kubernete编排技术八:使用operator管理有状态应用

gq_design
发布于 2022-5-1 08:48
浏览
0收藏

作者 |  朱晋君
来源 | 君哥聊技术(ID:gh_1f109b82d301)

operator是kubernetes的一个扩展,它使用自定义资源(Custom Resources)来管理应用和组件,并且遵循kubernetes的规范。它的核心就是自己编写控制器来实现自动化任务的效果,从而取代kubernetes自己的控制器和CRD资源。

使用operator是实现自动化的需求大概有以下几类:

  • 按照需求部署一个应用
  • 获取或者恢复一个应用的状态
  • 应用代码升级,同时关联的数据库或者配置等一并升级
  • 发布一个服务,让不支持kubernetes api的应用也能发现它
  • 模拟集群的故障以测试集群稳定性
  • 为分布式集群选取leader节点

上面的描述来自于kubernetes官网,这个描述可以看出,operator可以使用自定义资源来编排有状态应用。

本文我以使用etcd-operator部署etcd集群为例来介绍operator的编排功能,这个案例就是用operator在kubernetes上部署一套etcd的集群。

etcd集群搭建

下载源代码,地址如下:

https://github.com/coreos/etcd-operator

安装operator之前,先执行一个脚本,如下:

[root@master etcd-operator]# cd example/rbac/
[root@master rbac]# ./create_role.sh 
Creating role with ROLE_NAME=etcd-operator, NAMESPACE=default
clusterrole.rbac.authorization.k8s.io/etcd-operator created
Creating role binding with ROLE_NAME=etcd-operator, ROLE_BINDING_NAME=etcd-operator, NAMESPACE=default
clusterrolebinding.rbac.authorization.k8s.io/etcd-operator created

从这个执行结果我们看到,这个脚本是在为operator创建RBAC访问权限,这里创建了ClusterRole(etcd-operator)和ClusterRoleBinding(etcd-operator),他们的namespace都是default。对RBAC不太熟悉的同学可以查看这篇文章《kubernete编排技术六:RBAC权限控制》。

这样etcd-operator就具备了访问apiserver的权限。具体有哪些权限,我们查看一下etcd-operator这个ClusterRole:

[root@master rbac]# kubectl describe ClusterRole etcd-operator
Name:         etcd-operator
Labels:       <none>
Annotations:  <none>
PolicyRule:
  Resources                                       Non-Resource URLs  Resource Names  Verbs
  ---------                                       -----------------  --------------  -----
  endpoints                                       []                 []              [*]
  events                                          []                 []              [*]
  persistentvolumeclaims                          []                 []              [*]
  pods                                            []                 []              [*]
  services                                        []                 []              [*]
  customresourcedefinitions.apiextensions.k8s.io  []                 []              [*]
  deployments.apps                                []                 []              [*]
  etcdbackups.etcd.database.coreos.com            []                 []              [*]
  etcdclusters.etcd.database.coreos.com           []                 []              [*]
  etcdrestores.etcd.database.coreos.com           []                 []              [*]
  secrets                                         []                 []              [get]

可以看到,它的权限是非常大的,除了对secrets只有get权限外,对其他api对象都有所有的权限。我们再看下这个ClusterRoleBinding:

[root@master rbac]# kubectl describe ClusterRoleBinding etcd-operator
Name:         etcd-operator
Labels:       <none>
Annotations:  <none>
Role:
  Kind:  ClusterRole
  Name:  etcd-operator
Subjects:
  Kind            Name     Namespace
  ----            ----     ---------
  ServiceAccount  default  default

可以看到,User类型是一个ServiceAccount。

RBAC创建好之后,我们再来看一下etcd-operator的yaml文件,如下(注意,源码中的yaml文件不匹配我的kubernetes版本v1.17.3,我做了修改并且标记出来了):

[root@master example]# cat deployment.yaml 
apiVersion: apps/v1 #修改过
kind: Deployment
metadata:
  name: etcd-operator
spec:
  selector:  #增加这个标签
    matchLabels:
      app: etcd-operator
  replicas: 1
  template:
    metadata:
      labels:
        app: etcd-operator #key从name改为app
    spec:
      containers:
      - name: etcd-operator
        image: quay.io/coreos/etcd-operator:v0.9.4
        command:
        - etcd-operator
        # Uncomment to act for resources in all namespaces. More information in doc/user/clusterwide.md
        #- -cluster-wide
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name

可以看出,其实它就是一个Deployment,replicas配置的是1。下面我们创建这个对象,也就是创建etcd-operator:

[root@master example]# kubectl create -f deployment.yaml 
deployment.apps/etcd-operator created

创建成功后,过一段时间,就能看到这个pod进入running状态了:

[root@master rbac]# kubectl get pod
NAME                             READY   STATUS    RESTARTS   AGE
etcd-operator-84cf6bc5d5-gfwzn   1/1     Running   0          105s

创建operator时,应用会创建一个crd(Custom Resource Definition),这正是operator的特性,我们查看一下这个crd:

[root@master rbac]# kubectl get crd
NAME                                    CREATED AT
etcdclusters.etcd.database.coreos.com   2020-08-29T06:51:14Z

接着,我们查看一下这个crd的详细信息:

[root@master rbac]# kubectl describe crd etcdclusters.etcd.database.coreos.com
Name:         etcdclusters.etcd.database.coreos.com
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  apiextensions.k8s.io/v1
Kind:         CustomResourceDefinition
Metadata:
  Creation Timestamp:  2020-08-29T06:51:14Z
  Generation:          1
  Resource Version:    3553
  Self Link:           /apis/apiextensions.k8s.io/v1/customresourcedefinitions/etcdclusters.etcd.database.coreos.com
  UID:                 e71b69e9-1c55-4f38-a970-dad8484265ea
Spec:
  Conversion:
    Strategy:  None
  Group:       etcd.database.coreos.com
  Names:
    Kind:       EtcdCluster
    List Kind:  EtcdClusterList
    Plural:     etcdclusters
    Short Names:
      etcd
    Singular:               etcdcluster
  Preserve Unknown Fields:  true
  Scope:                    Namespaced
  Versions:
    Name:     v1beta2
    Served:   true
    Storage:  true
Status:
  Accepted Names:
    Kind:       EtcdCluster
    List Kind:  EtcdClusterList
    Plural:     etcdclusters
    Short Names:
      etcd
    Singular:  etcdcluster
  Conditions:
    Last Transition Time:  2020-08-29T06:51:14Z
    Message:               no conflicts found
    Reason:                NoConflicts
    Status:                True
    Type:                  NamesAccepted
    Last Transition Time:  2020-08-29T06:51:14Z
    Message:               the initial names have been accepted
    Reason:                InitialNamesAccepted
    Status:                True
    Type:                  Established
  Stored Versions:
    v1beta2
Events:  <none>

这个crd里面定义的Group是etcd.database.coreos.com,kind是EtcdCluster。有了这个crd,operator就可以作为一个控制器来对这个crd进行控制了。

下面我们创建集群,我们先看一下yaml文件,内容如下:

[root@master example]# cat example-etcd-cluster.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "example-etcd-cluster"
  ## Adding this annotation make this cluster managed by clusterwide operators
  ## namespaced operators ignore it
  # annotations:
  #   etcd.database.coreos.com/scope: clusterwide
spec:
  size: 2 #这儿修改过,原始配置是3
  version: "3.3.25" #这儿修改过,原始配置是3.2.13

这个yaml文件的定义非常简单,集群数量是2,etcd版本号是3.3.25,而kind就是我们自定义的资源类型EtcdCluster,所以它其实就是crd的具体实现,即CR(Custom Resources)。创建这个集群:

[root@master example]# kubectl create -f example-etcd-cluster.yaml
etcdcluster.etcd.database.coreos.com/example-etcd-cluster created

过一小段时间后查看pod创建情况,可以看到,集群中的2个节点已经创建出来了:

NAME                              READY   STATUS    RESTARTS   AGE
example-etcd-cluster-4t886mhnwv   1/1     Running   0          2m43s
example-etcd-cluster-jkclxffwf5   1/1     Running   0          2m52s

这时我们进入一个pod,进行etcd数据操作,结果如下,可以集群已经可以使用:

[root@master ~]# kubectl exec -it example-etcd-cluster-4t886mhnwv -- /bin/sh
/ # etcdctl set test "123456"
123456
/ # etcdctl get test
123456

注意:

1.因为我本地环境的限制,kubernetes集群只有2个节点,一个master和一个worker节点,而master节点默认是不能部署pod的,如果想要去除这个污点(Taint),有2个方法,一个就是我之前讲到的使用DaemonSet,另一个就是比较简单,使用下面命令:

#下面的master是主节点名字

kubectl taint node master node-role.kubernetes.io/master-

而想要让master节点恢复这个污点(Taint),使用如下命令:

kubectl taint node k8s-master node-role.kubernetes.io/master=""

2.一开始部署的时候,pod一直创建失败,查看docker日志,发现是nslookupup下面这个域名失败,提示的ip地址是10.96.0.10,这是因为etcd集群中的每个pod都有一个check-dns的容器在做这个事情:

example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc

这时我想到api-kubelete中配置的默认集群地址是:

--service-cluster-ip-range=10.96.0.0/12

我把上面地址修改成下面地址就可以了

#/etc/kubernetes/manifests/kube-apiserver.yaml文件
- --service-cluster-ip-range=10.244.0.0/16,

集群原理

首先我们查看这2个pod的详情:

[root@master example]# kubectl describe pod example-etcd-cluster-jkclxffwf5
Name:         example-etcd-cluster-jkclxffwf5
Namespace:    default
Priority:     0
Node:         worker1/192.168.59.141
Start Time:   Sat, 29 Aug 2020 03:03:57 -0400
Labels:       app=etcd
              etcd_cluster=example-etcd-cluster
              etcd_node=example-etcd-cluster-jkclxffwf5
Annotations:  etcd.version: 3.3.25
Status:       Running
IP:           10.244.1.3
IPs:
  IP:           10.244.1.3
Controlled By:  EtcdCluster/example-etcd-cluster
Init Containers:
  check-dns:
    Container ID:  docker://c45f53ff2aa08527fe969bb92dee474d1286b8279418c0493ea94c6fdac2635d
    Image:         busybox:1.28.0-glibc
    Image ID:      docker-pullable://busybox@sha256:0b55a30394294ab23b9afd58fab94e61a923f5834fba7ddbae7f8e0c11ba85e6
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      
                TIMEOUT_READY=0
                while ( ! nslookup example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc )
                do
                  # If TIMEOUT_READY is 0 we should never time out and exit 
                  TIMEOUT_READY=$(( TIMEOUT_READY-1 ))
                              if [ $TIMEOUT_READY -eq 0 ];
                                  then
                                      echo "Timed out waiting for DNS entry"
                                      exit 1
                                  fi
                              sleep 1
                            done
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 29 Aug 2020 03:03:57 -0400
      Finished:     Sat, 29 Aug 2020 03:03:58 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>
Containers:
  etcd:
    Container ID:  docker://71f35e088edf6633c8f5edad81cc157b278644c19ab21107e9f3b1facd55f72d
    Image:         quay.io/coreos/etcd:v3.3.25
    Image ID:      docker-pullable://quay.io/coreos/etcd@sha256:ff9226afaecbe1683f797f84326d1494092ac41d688b8d68b69f7a6462d51dc9
    Ports:         2380/TCP, 2379/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=example-etcd-cluster-jkclxffwf5
      --initial-advertise-peer-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
      --listen-peer-urls=http://0.0.0.0:2380
      --listen-client-urls=http://0.0.0.0:2379
      --advertise-client-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2379
      --initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
      --initial-cluster-state=new
      --initial-cluster-token=091b2149-aec1-428a-9fe4-c8d242d5be1b
    State:          Running
      Started:      Sat, 29 Aug 2020 03:03:59 -0400
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=10s timeout=10s period=60s #success=1 #failure=3
    Readiness:      exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=1s timeout=5s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /var/etcd from etcd-data (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  etcd-data:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From               Message
  ----    ------     ----       ----               -------
  Normal  Scheduled  <unknown>  default-scheduler  Successfully assigned default/example-etcd-cluster-jkclxffwf5 to worker1
  Normal  Pulled     15m        kubelet, worker1   Container image "busybox:1.28.0-glibc" already present on machine
  Normal  Created    15m        kubelet, worker1   Created container check-dns
  Normal  Started    15m        kubelet, worker1   Started container check-dns
  Normal  Pulled     15m        kubelet, worker1   Container image "quay.io/coreos/etcd:v3.3.25" already present on machine
  Normal  Created    15m        kubelet, worker1   Created container etcd
  Normal  Started    15m        kubelet, worker1   Started container etcd

这个pod是etcd的主节点,因为上面有个参数--initial-cluster-state=new,而从节点这个属性的值是existing,如下:

[root@master example]# kubectl describe pod example-etcd-cluster-4t886mhnwv
Name:         example-etcd-cluster-4t886mhnwv
Namespace:    default
Priority:     0
Node:         master/192.168.59.132
Start Time:   Sat, 29 Aug 2020 03:04:05 -0400
Labels:       app=etcd
              etcd_cluster=example-etcd-cluster
              etcd_node=example-etcd-cluster-4t886mhnwv
Annotations:  etcd.version: 3.3.25
Status:       Running
IP:           10.244.0.4
IPs:
  IP:           10.244.0.4
  #省略其他输出
Containers:
  etcd:
    Container ID:  docker://495b349704d5f4199b200913b4fb4d5d682258e486490e2ab64da1f7b55a5945
    Image:         quay.io/coreos/etcd:v3.3.25
    Image ID:      docker-pullable://quay.io/coreos/etcd@sha256:ff9226afaecbe1683f797f84326d1494092ac41d688b8d68b69f7a6462d51dc9
    Ports:         2380/TCP, 2379/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=example-etcd-cluster-4t886mhnwv
      --initial-advertise-peer-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
      --listen-peer-urls=http://0.0.0.0:2380
      --listen-client-urls=http://0.0.0.0:2379
      --advertise-client-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2379
      --initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380,example-etcd-cluster-4t886mhnwv=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
      --initial-cluster-state=existing
    State:          Running
      Started:      Sat, 29 Aug 2020 03:04:37 -0400
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=10s timeout=10s period=60s #success=1 #failure=3
    Readiness:      exec [/bin/sh -ec ETCDCTL_API=3 etcdctl endpoint status] delay=1s timeout=5s period=5s #success=1 #failure=3
#省略其他输出

这里介绍一下如果不采用etcd operator,etcd组建集群的过程,这样etcd组建集群就是一个静态的集群。

首先需要启动一个节点,这个节点的initial-cluster-state参数值是new,然后固定了ip地址,比如10.244.1.3,这样initial-cluster参数就是master=http://10.244.1.3:2380。

然后再启动一个etcd节点加入第一个节点就可以了,比如ip是10.244.0.4,这个节点的initial-cluster-state参数值是existing,加入成功后,集群中就有2个节点了,所以initial-cluster参数就是master=http://10.244.1.3:2380,worker1=http://10.244.0.4:2380

之后其他的节点都不断地加入第一个节点。这样这个静态集群就一步一步组建起来了。

而使用etcd operator的优势,就是可以把上面这个静态组建集群的过程自动化,上面两个pod的describe中,我们可以看到都有一个command命令,这个正是etcd operator自动组建集群的过程。下面再次贴一下这2个命令

/usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=example-etcd-cluster-jkclxffwf5
      --initial-advertise-peer-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
      --listen-peer-urls=http://0.0.0.0:2380
      --listen-client-urls=http://0.0.0.0:2379
      --advertise-client-urls=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2379
      --initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380
      --initial-cluster-state=new
      --initial-cluster-token=091b2149-aec1-428a-9fe4-c8d242d5be1b
    
/usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=example-etcd-cluster-4t886mhnwv
      --initial-advertise-peer-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
      --listen-peer-urls=http://0.0.0.0:2380
      --listen-client-urls=http://0.0.0.0:2379
      --advertise-client-urls=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2379
      --initial-cluster=example-etcd-cluster-jkclxffwf5=http://example-etcd-cluster-jkclxffwf5.example-etcd-cluster.default.svc:2380,example-etcd-cluster-4t886mhnwv=http://example-etcd-cluster-4t886mhnwv.example-etcd-cluster.default.svc:2380
      --initial-cluster-state=existing

etcd operator会根据size=2这个参数,动态组建一个etcd集群,首先启动一个节点,用上面的第一个命令,启动成功后,使用上面的第二个命令启动第二个节点,并且加入第一个节点的集群,重复这个过程只要pod数量达到了size的数量。从上面的initial-cluster字段我们能看出,operator组建的集群使用的是域名。

etcd operator创建集群的这个过程,其实就是operator对CRD的控制过程,operator控制的就是EtcdCluster类型的pod数量和EtcdCluster定义的size数量一致。

etcd operator不仅定义了集群的创建,同时也定义了集群的备份和故障恢复,yaml文件在下面2个目录,这个就不深入讲解了,

/root/k8s/etcd-operator/example/etcd-backup-operator
/root/k8s/etcd-operator/example/etcd-restore-operator

总结

operator的本质就是创建CRD,然后编写控制器来控制CRD的创建过程。跟StatefulSet的编排不一样的是,StatefulSet的编排pod是通过绑定编号的方式来固定拓扑结构的,而本文中operator的创建过程并没有这样,原因就是etcd operator的编排无非就是新增节点加入集群和删除多的节点,这个拓扑结果etcd内部就可以维护了,绑定编号没有意义。

分类
收藏
回复
举报
回复
    相关推荐