浅谈kubernete中的flannel网络插件
作者 | 朱晋君
来源 | 君哥聊技术(ID:gh_1f109b82d301)
容器中的网络,无非就是2个方面,同一台宿主机上面的容器是可以联通的,不同宿主机上的容器相互间也是可以联通的。
在kubernete的发展历程中,kubernete并没有实现自己的网络规范,而是专注于编排的核心功能。一个重要的原因就是当时已经有了coreos发起的cni网络规范,而flannel模型最初已经能满足kubernete使用了。后来即使有了一些复杂的问题,calico和weave也基本解决。
CNI网络插件已经成为容器领域事实上的网络标准,它主要有2部分
1.CNI插件负责给容器配置网络
2.IPAM插件负责给容器分配IP地址,主要实现方式有host-local和dhcp
flannel通过给每台宿主机分配一个子网的方式为容器提供虚拟网络,它基于Linux TUN/VTEP,使用UDP封装IP包来创建虚拟网络,使用etcd存储网络的分配情况。
下面再回顾一下我们上篇kubernete部署springboot的系统结构,见下图
主从节点部署在2个虚机vmware1和vmware2上面作为宿主机,主节点ip是192.168.59.132,从节点ip是192.168.59.138,springboot应用启动后,pod状态如下:
[root@master manifests]# kubectl get pod -l app -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
springboot-mybatis-deployment-5b78f66997-6kl96 1/1 Running 0 6m15s 10.244.1.5 worker1 <none> <none>
springboot-mybatis-deployment-5b78f66997-9hdl8 1/1 Running 0 6m15s 10.244.1.4 worker1 <none> <none>
同一台宿主机上容器间通信
pod创建后,会在宿主机上创建一个网桥,我们在宿主机192.168.59.138上执行ifconfig命令,可以看到cni0的网桥,连接到这个网桥上的设备,都可以通过cni0来进行通信。也可以看到veth0f38c044、vetha234d8db这2个网卡。
输出如下:
[root@worker1 kubernetes]# ifconfig
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.1.1 netmask 255.255.255.0 broadcast 0.0.0.0
inet6 fe80::6483:ff:fedb:8c82 prefixlen 64 scopeid 0x20<link>
ether 66:83:00:db:8c:82 txqueuelen 1000 (Ethernet)
RX packets 34 bytes 2136 (2.0 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 25 bytes 3837 (3.7 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 0.0.0.0
ether 02:42:1d:56:39:7d txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.59.138 netmask 255.255.255.0 broadcast 192.168.59.255
inet6 fe80::3cfd:163:46da:952 prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:9f:fe:5c txqueuelen 1000 (Ethernet)
RX packets 20796 bytes 6814956 (6.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 14580 bytes 1441166 (1.3 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.1.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::f4d6:9dff:fe6d:21c6 prefixlen 64 scopeid 0x20<link>
ether f6:d6:9d:6d:21:c6 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 25 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 622 bytes 38464 (37.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 622 bytes 38464 (37.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth0f38c044: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::2497:75ff:fe91:8d82 prefixlen 64 scopeid 0x20<link>
ether 26:97:75:91:8d:82 txqueuelen 0 (Ethernet)
RX packets 9 bytes 698 (698.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 25 bytes 1962 (1.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vetha234d8db: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::e46f:9aff:fe68:9d6c prefixlen 64 scopeid 0x20<link>
ether e6:6f:9a:68:9d:6c txqueuelen 0 (Ethernet)
RX packets 9 bytes 698 (698.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 23 bytes 1830 (1.7 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255
ether 52:54:00:c8:c6:2b txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
执行brctl show命令可以看到,veth0f38c044、vetha234d8db这2个网卡被插在了cni0网桥上
[root@worker1 kubernetes]# brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.668300db8c82 no veth0f38c044
vetha234d8db
docker0 8000.02421d56397d no
virbr0 8000.525400c8c62b yes virbr0-nic
而我们进入pod后查看网络,可以看到这2个pod都分别有一个eth0网卡,这个网卡正是和上面的veth0f38c044、vetha234d8db这2个网卡配对的网卡。
进入第一个pod
kubectl exec -it springboot-mybatis-deployment-5b78f66997-6kl96 -- /bin/sh
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 22:85:63:4A:FB:92
inet addr:10.244.1.5 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::2085:63ff:fe4a:fb92/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:23 errors:0 dropped:0 overruns:0 frame:0
TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1830 (1.7 KiB) TX bytes:698 (698.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
进入第二个pod
kubectl exec -it springboot-mybatis-deployment-5b78f66997-9hdl8 -- /bin/sh
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 92:23:79:4B:57:08
inet addr:10.244.1.4 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::9023:79ff:fe4b:5708/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:25 errors:0 dropped:0 overruns:0 frame:0
TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1962 (1.9 KiB) TX bytes:698 (698.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
这2对虚拟网卡叫做veth pair设备,它总是成对出现,并且一个网卡上的数据会同时到达第二个网卡。而且这2对网卡一端都连接到cni0网桥,另一端分别在2个springboot的容器中,所以在2个容器中ping对方可以ping通。
注:在linux操作系统上,网桥是工作在数据链路层,网络包在数据链路层通过mac地址来进行数据发送。所以在10.244.1.4地址ping10.244.1.5时,10.244.1.4发出一个网络包,这个网络包通过veth0f38c044网卡到达cni0,cni0通过ARP广播找到对应10.244.1.5网卡的mac地址发送给它,数据到达vetha234d8db网卡后也就出现在了10.244.1.5的eth0网卡上,最终实现数据包接收。
数据包的返回跟这个过程完全一样。整个流程总结如下:
跨主机容器间通信
我们看一下宿主机192.168.59.132的网络信息
[root@master k8s]# ifconfig
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.0.1 netmask 255.255.255.0 broadcast 0.0.0.0
inet6 fe80::144f:37ff:fe05:b44c prefixlen 64 scopeid 0x20<link>
ether 16:4f:37:05:b4:4c txqueuelen 1000 (Ethernet)
RX packets 125 bytes 12199 (11.9 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 132 bytes 41052 (40.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 0.0.0.0
ether 02:42:7c:31:8d:23 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.59.132 netmask 255.255.255.0 broadcast 192.168.59.255
inet6 fe80::59a7:26b1:56a7:4b00 prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:69:d6:60 txqueuelen 1000 (Ethernet)
RX packets 2199 bytes 237016 (231.4 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1741 bytes 684756 (668.7 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.0.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::c43c:eaff:fe58:99c9 prefixlen 64 scopeid 0x20<link>
ether c6:3c:ea:58:99:c9 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 26 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 91824 bytes 22983724 (21.9 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 91824 bytes 22983724 (21.9 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth7de5bd16: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::48e3:d9ff:feba:665e prefixlen 64 scopeid 0x20<link>
ether 4a:e3:d9:ba:66:5e txqueuelen 0 (Ethernet)
RX packets 65 bytes 7141 (6.9 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 80 bytes 20845 (20.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vethc1c4c291: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::78f2:cfff:fea7:9cdf prefixlen 64 scopeid 0x20<link>
ether 7a:f2:cf:a7:9c:df txqueuelen 0 (Ethernet)
RX packets 62 bytes 6988 (6.8 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 103 bytes 24534 (23.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255
ether 52:54:00:42:d3:29 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
之前搭建的集群中,2个springboot的pod都被调度到了192.168.59.138上。这时我们如果在主节点上创建一个pod,部署应用后pod的ip地址是10.244.0.2。这时如果我们想让容器ip10.244.1.4访问10.244.0.2,因为目的地址10.244.0.2并不在宿主机192.168.59.138上的cni0网桥网段内,所以是不可能像同一台宿主机那样可以通过cni0网桥直接通信的。
这就需要选择新的路由规则。我们看一下192.168.59.138的路由规则如下:
[root@worker1 ~]# ip route
default via 192.168.59.2 dev ens33 proto dhcp metric 100
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink
10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.59.0/24 dev ens33 proto kernel scope link src 192.168.59.139 metric 100
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1
这种情况下该宿主机上的的数据包要发出去,就需要匹配第二条规则,走flannel.1这个设备,这个设备其实是一个工作在网络层第三层的TUN设备,他本质上是一个宿主机上的flanneld进程,它的功能是在用户态和内核态之间传送IP网络包。
之前我们提到过,flannel方案的思想是为每一台宿主机创建一个子网,有了这个子网,TUN设备就可以找到下一跳的网络地址,比如本文中集群的2个子网地址分别是10.244.1.0/24和10.244.0.0/24。这样宿主机上容器10.244.1.4的网络包要发送到宿主机容器10.244.0.2上,就可以通过宿主机上的flanneld进程创建的子网找到对应的宿主机。
我们再来看宿主机192.168.59.132的路由规则
[root@master k8s]# ip route
default via 192.168.59.2 dev ens33 proto dhcp metric 100
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.59.0/24 dev ens33 proto kernel scope link src 192.168.59.132 metric 100
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1
10.244.1.4到达宿主机192.168.59.132后,会根据上面的第2条路由规则,找到宿主机上的cni0网桥设备,这样就跟上一节的同一台宿主机的容器间通信一样,可以找到对应的容器地址了。
上面的过程数据包从目的容器所在宿主机发出前flanneld应用会进行一次UDP封包,而到达目的容器宿主机进入flanneld应用会进行数据包的解封装。这样的问题是数据拷贝次数太多,包括从docker应用到cni0网桥,从cni网桥到flanneld应用,从flanneld应用到宿主机的ens33网卡。
为了解决这个问题,flannel引入了VXLAN模式,上面的数据包封装和解封装都在内核态完成,它的解决方案是在上面的3层网络基础上构建出二层网络,它的做法是引入一个VTEP设备,回去再看我们的2台宿主机的网络,上面的flannel.1就成为一个VTEP设备,既有ip地址又有mac地址,它可以对二层数据包进行封装和解封装,减去了数据在内核态和用户态直接的拷贝。整个通信过程如下:
最后,flannel网络插件是怎么给容器分配网络呢?在上篇介绍的如何在kubernete集群上部署springboot应用,使用命令kubectl apply -f springboot-mybatis.yaml创建pod时,第一个pod里面创建的容器是一个infra容器,这个容器的作用其实就是要控制network-namespace,这个容器创建完成后,kubernete就会调用cni网络插件为这个容器配置网络,而pod里面的其他容器则会跟infra容器共享network-namespace。
总结
kubernete选择cni网络插件进行管理,有一定历史原因,但是集成了cni,对于网络的配置非常方便,自身可以专注于编排,由于个人能力有限,只能讲到这里了,里面的不正确的地方,请大佬们批评指正。
了解下