虚拟化容器,大数据,DBA,中间件,监控。

一次kubernetes calico网络组件的异常状态问题排查

16 11月
作者:admin|分类:容器虚拟化

k8s新增节点组件启动成功但是calico启动不成功节点无法加入集群

新增了master和node节点,启动完相关组件之后,calico组件一直处于CrashLoopBackOff状态,calico组件pod报错内容如下

Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  12m                     default-scheduler  Successfully assigned kube-system/calico-node-89p55 to k8s-20-53
  Normal   Pulled     12m                     kubelet            Container image "calico/cni:v3.15.1" already present on machine
  Normal   Started    12m                     kubelet            Started container upgrade-ipam
  Normal   Created    12m                     kubelet            Created container upgrade-ipam
  Normal   Started    12m                     kubelet            Started container install-cni
  Normal   Pulled     12m                     kubelet            Container image "calico/cni:v3.15.1" already present on machine
  Normal   Created    12m                     kubelet            Created container install-cni
  Normal   Started    12m                     kubelet            Started container flexvol-driver
  Normal   Pulled     12m                     kubelet            Container image "calico/pod2daemon-flexvol:v3.15.1" already present on machine
  Normal   Created    12m                     kubelet            Created container flexvol-driver
  Normal   Pulled     12m                     kubelet            Container image "calico/node:v3.15.1" already present on machine
  Normal   Created    12m                     kubelet            Created container calico-node
  Normal   Started    12m                     kubelet            Started container calico-node
  Warning  Unhealthy  12m (x6 over 12m)       kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
  Normal   Killing    11m                     kubelet            Container calico-node failed liveness probe, will be restarted
  Warning  Unhealthy  7m50s (x25 over 12m)    kubelet            Liveness probe failed: calico/node is not ready: bird/confd is not live: exit status 1
  Warning  BackOff    2m49s (x12 over 5m58s)  kubelet            Back-off restarting failed container

在这里插入图片描述

calico的pod日志如下,连接apiserver一直是超时

root@k8s-20-49 ~]# kubectl logs -f calico-node-89p55 -n kube-system
2021-09-16 00:56:52.987 [INFO][9] startup/startup.go 299: Early log level set to info
2021-09-16 00:56:52.987 [INFO][9] startup/startup.go 315: Using NODENAME environment for node name
2021-09-16 00:56:52.987 [INFO][9] startup/startup.go 327: Determined node name: k8s-20-53
2021-09-16 00:56:52.991 [INFO][9] startup/startup.go 359: Checking datastore connection
2021-09-16 00:57:22.992 [INFO][9] startup/startup.go 374: Hit error connecting to datastore - retry error=Get https://10.0.0.1:443/api/v1/nodes/foo: dial tcp 10.0.0.1:443: i/o timeout
2021-09-16 00:57:53.993 [INFO][9] startup/startup.go 374: Hit error connecting to datastore - retry error=Get https://10.0.0.1:443/api/v1/nodes/foo: dial tcp 10.0.0.1:443: i/o timeout

calico组件并没有成功运行,但是其状态却为running,这点非常诡异
在这里插入图片描述

紧接着去看k8s各个node节点,发现有问题的新node节点的状态却也是Ready状态
在这里插入图片描述

根据这两个日志报错以及可靠的信息进行分析,经过n次的反复排查和尝试(重新安装、百度各种报错信息),最终定位问题可能存在的原因有三点:

  • 1、由于新节点的IP地址没有在apsierver证书文件的hosts字段里,从而使calico组件产生报错,节点无法成功加入集群
  • 2、新节点的kube-proxy组件没有部署或者没有启动,就先启动了kubelet,并且master组件也授权加入了集群,但此时没有kube-proxy的加入,无法解析新的node节点,因此就会导致calico组件无法正常启动,从而无法加集群
  • 3、新节点IP地址存在于apiserver的证书文件里,kube-proxy服务也都全部启动成功,但是calico依旧报上面的错误,这个现象的处理方法就是重启所有master以及node节点上的kube-apiserver、kubelet、kube-proxy服务,从而得到处理

经过以上三种现象的种种处理后,kubernetes新节点一切准备就绪
在这里插入图片描述

浏览1118 评论0
返回
目录
返回
首页
⭐一文汇总Etcd数据库几种常见故障及排查思路⭐ 解决serviceaccount用户认证挂载密文token文件失败导致pod使用anonymous用户问题