kubernete编排技术八：使用operator管理有状态应用

gq_design

发布于 2022-5-1 08:48

浏览

0收藏

作者 | 朱晋君
来源 | 君哥聊技术（ID：gh_1f109b82d301）

operator是kubernetes的一个扩展，它使用自定义资源(Custom Resources)来管理应用和组件，并且遵循kubernetes的规范。它的核心就是自己编写控制器来实现自动化任务的效果，从而取代kubernetes自己的控制器和CRD资源。

使用operator是实现自动化的需求大概有以下几类：

按照需求部署一个应用
获取或者恢复一个应用的状态
应用代码升级，同时关联的数据库或者配置等一并升级
发布一个服务，让不支持kubernetes api的应用也能发现它
模拟集群的故障以测试集群稳定性
为分布式集群选取leader节点

上面的描述来自于kubernetes官网，这个描述可以看出，operator可以使用自定义资源来编排有状态应用。

本文我以使用etcd-operator部署etcd集群为例来介绍operator的编排功能，这个案例就是用operator在kubernetes上部署一套etcd的集群。

etcd集群搭建

下载源代码，地址如下：

https://github.com/coreos/etcd-operator1.

安装operator之前，先执行一个脚本，如下：

[root@master etcd-operator]# cd example/rbac/
[root@master rbac]# ./create_role.sh 
Creating role with ROLE_NAME=etcd-operator, NAMESPACE=default
clusterrole.rbac.authorization.k8s.io/etcd-operator created
Creating role binding with ROLE_NAME=etcd-operator, ROLE_BINDING_NAME=etcd-operator, NAMESPACE=default
clusterrolebinding.rbac.authorization.k8s.io/etcd-operator created1.
2.
3.
4.
5.
6.

从这个执行结果我们看到，这个脚本是在为operator创建RBAC访问权限，这里创建了ClusterRole(etcd-operator)和ClusterRoleBinding(etcd-operator),他们的namespace都是default。对RBAC不太熟悉的同学可以查看这篇文章《kubernete编排技术六：RBAC权限控制》。

这样etcd-operator就具备了访问apiserver的权限。具体有哪些权限，我们查看一下etcd-operator这个ClusterRole：

[root@master rbac]# kubectl describe ClusterRole etcd-operator
Name:         etcd-operator
Labels:       <none>
Annotations:  <none>
PolicyRule:
  Resources                                       Non-Resource URLs  Resource Names  Verbs
  ---------                                       -----------------  --------------  -----
  endpoints                                       []                 []              [*]
  events                                          []                 []              [*]
  persistentvolumeclaims                          []                 []              [*]
  pods                                            []                 []              [*]
  services                                        []                 []              [*]
  customresourcedefinitions.apiextensions.k8s.io  []                 []              [*]
  deployments.apps                                []                 []              [*]
  etcdbackups.etcd.database.coreos.com            []                 []              [*]
  etcdclusters.etcd.database.coreos.com           []                 []              [*]
  etcdrestores.etcd.database.coreos.com           []                 []              [*]
  secrets                                         []                 []              [get]1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.

可以看到，它的权限是非常大的，除了对secrets只有get权限外，对其他api对象都有所有的权限。我们再看下这个ClusterRoleBinding：

[root@master rbac]# kubectl describe ClusterRoleBinding etcd-operator
Name:         etcd-operator
Labels:       <none>
Annotations:  <none>
Role:
  Kind:  ClusterRole
  Name:  etcd-operator
Subjects:
  Kind            Name     Namespace
  ----            ----     ---------
  ServiceAccount  default  default1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.

可以看到，User类型是一个ServiceAccount。

RBAC创建好之后，我们再来看一下etcd-operator的yaml文件，如下(注意，源码中的yaml文件不匹配我的kubernetes版本v1.17.3，我做了修改并且标记出来了)：

[root@master example]# cat deployment.yaml 
apiVersion: apps/v1 #修改过
kind: Deployment
metadata:
  name: etcd-operator
spec:
  selector:  #增加这个标签
    matchLabels:
      app: etcd-operator
  replicas: 1
  template:
    metadata:
      labels:
        app: etcd-operator #key从name改为app
    spec:
      containers:
      - name: etcd-operator
        image: quay.io/coreos/etcd-operator:v0.9.4
        command:
        - etcd-operator
        # Uncomment to act for resources in all namespaces. More information in doc/user/clusterwide.md
        #- -cluster-wide
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.

可以看出，其实它就是一个Deployment，replicas配置的是1。下面我们创建这个对象，也就是创建etcd-operator：

[root@master example]# kubectl create -f deployment.yaml 
deployment.apps/etcd-operator created1.
2.

创建成功后，过一段时间，就能看到这个pod进入running状态了:

[root@master rbac]# kubectl get pod
NAME                             READY   STATUS    RESTARTS   AGE
etcd-operator-84cf6bc5d5-gfwzn   1/1     Running   0          105s1.
2.
3.

创建operator时，应用会创建一个crd(Custom Resource Definition)，这正是operator的特性，我们查看一下这个crd：

[root@master rbac]# kubectl get crd
NAME                                    CREATED AT
etcdclusters.etcd.database.coreos.com   2020-08-29T06:51:14Z1.
2.
3.

接着，我们查看一下这个crd的详细信息：

[root@master rbac]# kubectl describe crd etcdclusters.etcd.database.coreos.com
Name:         etcdclusters.etcd.database.coreos.com
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  apiextensions.k8s.io/v1
Kind:         CustomResourceDefinition
Metadata:
  Creation Timestamp:  2020-08-29T06:51:14Z
  Generation:          1
  Resource Version:    3553
  Self Link:           /apis/apiextensions.k8s.io/v1/customresourcedefinitions/etcdclusters.etcd.database.coreos.com
  UID:                 e71b69e9-1c55-4f38-a970-dad8484265ea
Spec:
  Conversion:
    Strategy:  None
  Group:       etcd.database.coreos.com
  Names:
    Kind:       EtcdCluster
    List Kind:  EtcdClusterList
    Plural:     etcdclusters
    Short Names:
      etcd
    Singular:               etcdcluster
  Preserve Unknown Fields:  true
  Scope:                    Namespaced
  Versions:
    Name:     v1beta2
    Served:   true
    Storage:  true
Status:
  Accepted Names:
    Kind:       EtcdCluster
    List Kind:  EtcdClusterList
    Plural:     etcdclusters
    Short Names:
      etcd
    Singular:  etcdcluster
  Conditions:
    Last Transition Time:  2020-08-29T06:51:14Z
    Message:               no conflicts found
    Reason:                NoConflicts
    Status:                True
    Type:                  NamesAccepted
    Last Transition Time:  2020-08-29T06:51:14Z
    Message:               the initial names have been accepted
    Reason:                InitialNamesAccepted
    Status:                True
    Type:                  Established
  Stored Versions:
    v1beta2
Events:  <none>1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.

这个crd里面定义的Group是etcd.database.coreos.com，kind是EtcdCluster。有了这个crd，operator就可以作为一个控制器来对这个crd进行控制了。

下面我们创建集群，我们先看一下yaml文件，内容如下：

[root@master example]# cat example-etcd-cluster.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
  ## Adding this annotation make this cluster managed by clusterwide operators
  ## namespaced operators ignore it
  # annotations:
  #   etcd.database.coreos.com/scope: clusterwide
spec:
  size: 2 #这儿修改过，原始配置是3
  version: "3.3.25" #这儿修改过，原始配置是3.2.131.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

这个yaml文件的定义非常简单，集群数量是2，etcd版本号是3.3.25，而kind就是我们自定义的资源类型EtcdCluster，所以它其实就是crd的具体实现，即CR(Custom Resources)。创建这个集群：

[root@master example]# kubectl create -f example-etcd-cluster.yaml
etcdcluster.etcd.database.coreos.com/example-etcd-cluster created1.
2.

过一小段时间后查看pod创建情况，可以看到，集群中的2个节点已经创建出来了：

NAME                              READY   STATUS    RESTARTS   AGE
example-etcd-cluster-4t886mhnwv   1/1     Running   0          2m43s
example-etcd-cluster-jkclxffwf5   1/1     Running   0          2m52s1.
2.
3.

这时我们进入一个pod，进行etcd数据操作，结果如下，可以集群已经可以使用：

[root@master ~]# kubectl exec -it example-etcd-cluster-4t886mhnwv -- /bin/sh
/ # etcdctl set test "123456"
123456
/ # etcdctl get test
1234561.
2.
3.
4.
5.

注意：

1.因为我本地环境的限制，kubernetes集群只有2个节点，一个master和一个worker节点，而master节点默认是不能部署pod的，如果想要去除这个污点(Taint)，有2个方法，一个就是我之前讲到的使用DaemonSet，另一个就是比较简单，使用下面命令：

#下面的master是主节点名字

kubectl taint node master node-role.kubernetes.io/master-1.

而想要让master节点恢复这个污点(Taint)，使用如下命令：

kubectl taint node k8s-master node-role.kubernetes.io/master=""1.

2.一开始部署的时候，pod一直创建失败，查看docker日志，发现是nslookupup下面这个域名失败，提示的ip地址是10.96.0.10，这是因为etcd集群中的每个pod都有一个check-dns的容器在做这个事情：

example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc1.

这时我想到api-kubelete中配置的默认集群地址是：

--service-cluster-ip-range=10.96.0.0/121.

我把上面地址修改成下面地址就可以了

#/etc/kubernetes/manifests/kube-apiserver.yaml文件
- --service-cluster-ip-range=10.244.0.0/16,1.
2.

集群原理

首先我们查看这2个pod的详情：

[root@master example]# kubectl describe pod example-etcd-cluster-jkclxffwf5
Name:         example-etcd-cluster-jkclxffwf5
Namespace:    default
Priority:     0
Node:         worker1/192.168.59.141
Start Time:   Sat, 29 Aug 2020 03:03:57 -0400
Labels:       app=etcd
              etcd_cluster=example-etcd-cluster
              etcd_node=example-etcd-cluster-jkclxffwf5
Annotations:  etcd.version: 3.3.25
Status:       Running
IP:           10.244.1.3
IPs:
  IP:           10.244.1.3
Controlled By:  EtcdCluster/example-etcd-cluster
Init Containers:
  check-dns:
    Container ID:  docker://c45f53ff2aa08527fe969bb92dee474d1286b8279418c0493ea94c6fdac2635d
    Image:         busybox:1.28.0-glibc
    Image ID:      docker-pullable://busybox@sha256:0b55a30394294ab23b9afd58fab94e61a923f5834fba7ddbae7f8e0c11ba85e6
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      
                TIMEOUT_READY=0
                while ( ! nslookup example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc )
                do
                  # If TIMEOUT_READY is 0 we should never time out and exit 
                  TIMEOUT_READY=$(( TIMEOUT_READY-1 ))
                              if [ $TIMEOUT_READY -eq 0 ];
                                  then
                                      echo "Timed out waiting for DNS entry"
                                      exit 1
                                  fi
                              sleep 1
                            done
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 29 Aug 2020 03:03:57 -0400
      Finished:     Sat, 29 Aug 2020 03:03:58 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>
Containers:
  etcd:
    Container ID:  docker://71f35e088edf6633c8f5edad81cc157b278644c19ab21107e9f3b1facd55f72d
    Image:         quay.io/coreos/etcd:v3.3.25
    Image ID:      docker-pullable://quay.io/coreos/etcd@sha256:ff9226afaecbe1683f797f84326d1494092ac41d688b8d68b69f7a6462d51dc9
    Ports:         2380/TCP, 2379/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=example-etcd-cluster-jkclxffwf5
      --initial-advertise-peer-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
      --listen-peer-urls=http://0.0.0.0:2380
      --listen-client-urls=http://0.0.0.0:2379
      --advertise-client-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2379
      --initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
      --initial-cluster-state=new
      --initial-cluster-token=091b2149-aec1-428a-9fe4-c8d242d5be1b
    State:          Running
      Started:      Sat, 29 Aug 2020 03:03:59 -0400
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=10s timeout=10s period=60s #success=1 #failure=3
    Readiness:      exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=1s timeout=5s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /var/etcd from etcd-data (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  etcd-data:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From               Message
  ----    ------     ----       ----               -------
  Normal  Scheduled  <unknown>  default-scheduler  Successfully assigned default/example-etcd-cluster-jkclxffwf5 to worker1
  Normal  Pulled     15m        kubelet, worker1   Container image "busybox:1.28.0-glibc" already present on machine
  Normal  Created    15m        kubelet, worker1   Created container check-dns
  Normal  Started    15m        kubelet, worker1   Started container check-dns
  Normal  Pulled     15m        kubelet, worker1   Container image "quay.io/coreos/etcd:v3.3.25" already present on machine
  Normal  Created    15m        kubelet, worker1   Created container etcd
  Normal  Started    15m        kubelet, worker1   Started container etcd1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.

这个pod是etcd的主节点，因为上面有个参数--initial-cluster-state=new，而从节点这个属性的值是existing，如下：

[root@master example]# kubectl describe pod example-etcd-cluster-4t886mhnwv
Name:         example-etcd-cluster-4t886mhnwv
Namespace:    default
Priority:     0
Node:         master/192.168.59.132
Start Time:   Sat, 29 Aug 2020 03:04:05 -0400
Labels:       app=etcd
              etcd_cluster=example-etcd-cluster
              etcd_node=example-etcd-cluster-4t886mhnwv
Annotations:  etcd.version: 3.3.25
Status:       Running
IP:           10.244.0.4
IPs:
  IP:           10.244.0.4
  #省略其他输出
Containers:
  etcd:
    Container ID:  docker://495b349704d5f4199b200913b4fb4d5d682258e486490e2ab64da1f7b55a5945
    Image:         quay.io/coreos/etcd:v3.3.25
    Image ID:      docker-pullable://quay.io/coreos/etcd@sha256:ff9226afaecbe1683f797f84326d1494092ac41d688b8d68b69f7a6462d51dc9
    Ports:         2380/TCP, 2379/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=example-etcd-cluster-4t886mhnwv
      --initial-advertise-peer-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
      --listen-peer-urls=http://0.0.0.0:2380
      --listen-client-urls=http://0.0.0.0:2379
      --advertise-client-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2379
      --initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380,example-etcd-cluster-4t886mhnwv=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
      --initial-cluster-state=existing
    State:          Running
      Started:      Sat, 29 Aug 2020 03:04:37 -0400
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=10s timeout=10s period=60s #success=1 #failure=3
    Readiness:      exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=1s timeout=5s period=5s #success=1 #failure=3
#省略其他输出1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.

这里介绍一下如果不采用etcd operator,etcd组建集群的过程，这样etcd组建集群就是一个静态的集群。

首先需要启动一个节点，这个节点的initial-cluster-state参数值是new，然后固定了ip地址，比如10.244.1.3，这样initial-cluster参数就是master=http://10.244.1.3:2380。

然后再启动一个etcd节点加入第一个节点就可以了，比如ip是10.244.0.4，这个节点的initial-cluster-state参数值是existing，加入成功后，集群中就有2个节点了，所以initial-cluster参数就是master=http://10.244.1.3:2380,worker1=http://10.244.0.4:2380

之后其他的节点都不断地加入第一个节点。这样这个静态集群就一步一步组建起来了。

而使用etcd operator的优势，就是可以把上面这个静态组建集群的过程自动化，上面两个pod的describe中，我们可以看到都有一个command命令，这个正是etcd operator自动组建集群的过程。下面再次贴一下这2个命令

/usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=example-etcd-cluster-jkclxffwf5
      --initial-advertise-peer-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
      --listen-peer-urls=http://0.0.0.0:2380
      --listen-client-urls=http://0.0.0.0:2379
      --advertise-client-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2379
      --initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
      --initial-cluster-state=new
      --initial-cluster-token=091b2149-aec1-428a-9fe4-c8d242d5be1b
    
/usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=example-etcd-cluster-4t886mhnwv
      --initial-advertise-peer-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
      --listen-peer-urls=http://0.0.0.0:2380
      --listen-client-urls=http://0.0.0.0:2379
      --advertise-client-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2379
      --initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380,example-etcd-cluster-4t886mhnwv=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
      --initial-cluster-state=existing
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.

etcd operator会根据size=2这个参数，动态组建一个etcd集群，首先启动一个节点，用上面的第一个命令，启动成功后，使用上面的第二个命令启动第二个节点，并且加入第一个节点的集群，重复这个过程只要pod数量达到了size的数量。从上面的initial-cluster字段我们能看出，operator组建的集群使用的是域名。

etcd operator创建集群的这个过程，其实就是operator对CRD的控制过程，operator控制的就是EtcdCluster类型的pod数量和EtcdCluster定义的size数量一致。

etcd operator不仅定义了集群的创建，同时也定义了集群的备份和故障恢复，yaml文件在下面2个目录，这个就不深入讲解了，

/root/k8s/etcd-operator/example/etcd-backup-operator
/root/k8s/etcd-operator/example/etcd-restore-operator1.
2.

总结

operator的本质就是创建CRD，然后编写控制器来控制CRD的创建过程。跟StatefulSet的编排不一样的是，StatefulSet的编排pod是通过绑定编号的方式来固定拓扑结构的，而本文中operator的创建过程并没有这样，原因就是etcd operator的编排无非就是新增节点加入集群和删除多的节点，这个拓扑结果etcd内部就可以维护了，绑定编号没有意义。

分类

云原生

标签

kubernete

operator

51CTO

51CTO博客

51CTO学堂

kubernete编排技术八：使用operator管理有状态应用

订阅鸿蒙技术特刊，精选内容抢先看