问题描述
由于断电停机,kubernetes集群挂掉,使用任意kubectl 命令会报错:The connection to the server ip:6443 was refused – did you specify the right host or port,重启kubelet也不能恢复,etcd读取数据报错,数据文件损坏
排查过程
首先检查服务是否启动有无报错
[root@pengfei-master1 ~]# systemctl status kubelet ● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/kubelet.service.d └─10-kubeadm.conf Active: active (running) since 三 2023-07-19 17:47:06 CST; 35min ago Docs: https://kubernetes.io/docs/ Main PID: 3087 (kubelet) Tasks: 14 Memory: 46.8M CGroup: /system.slice/kubelet.service └─3087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/con... 7月 19 18:22:36 pengfei-master1 kubelet[3087]: E0719 18:22:36.296785 3087 kubelet.go:2448] "Error getting node" err="node \"pengfei-master1\" not found" 7月 19 18:22:37 pengfei-master1 kubelet[3087]: E0719 18:22:37.105683 3087 kubelet.go:2448] "Error getting node" err="node \"pengfei-master1\" not found" 7月 19 18:22:37 pengfei-master1 kubelet[3087]: E0719 18:22:37.149596 3087 eviction_manager.go:256] "Eviction manager: failed to get summary stats" err="fa...not found"
有报错,找不到master节点,继续查看kubelet还是有报错
[root@pengfei-master1 ~]# journalctl -u kubelet 7月 19 17:34:00 pengfei-master1 kubelet[3480]: E0719 17:34:00.711540 3480 kubelet.go:2448] "Error getting node" err="node \"pengfei-master1\" not found" 7月 19 17:34:00 pengfei-master1 kubelet[3480]: E0719 17:34:00.813511 3480 kubelet.go:2448] "Error getting node" err="node \"pengfei-master1\" not found"
在确认aip-server有没有挂掉,如果挂了去查看日志
查看apiserver是否挂掉
注意:如果你使用的是docker,执行这个docker ps -a| grep kube-apiserver
[root@pengfei-master1 ~]# crictl ps -a| grep kube-apiserver I0719 18:02:28.083463 4551 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/run/containerd/containerd.sock" URL="unix:///run/containerd/containerd.sock" e6284e624f40a 4d2edfd10d3e3 2 minutes ago Exited kube-apiserver 62 0b9b24371d25f kube-apiserver-pengfei-master1
可以看出apiserver已经退出了
查看etcd是否挂掉
注意:如果你使用的是docker,执行这个docker ps -a| grep etcd
[root@pengfei-master1 ~]# crictl ps -a| grep etcd I0719 18:03:10.350597 4563 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/run/containerd/containerd.sock" URL="unix:///run/containerd/containerd.sock" 9bc45fb63f604 a8a176a5d5d69 5 minutes ago Exited etcd 90 614e62c0c8ed0 etcd-pengfei-master1
可以看出etcd也已经退出了
接下来查看etcd日志,分析为什么会退出
/var/log/pods/kube-system_etcd(此处使用tab补全,进入后查看相应的日志报错,根据相应日志去处理对应问题),我这里是/var/log/pods/kube-system_etcd-pengfei-master1_df91a367268810494a84463207726090/etcd
异常信息
2023-07-19T18:03:17.046456705+08:00 stderr F panic: freepages: failed to get all reachable pages (page 700: multiple references) 2023-07-19T18:03:17.046481181+08:00 stderr F 2023-07-19T18:03:17.046483923+08:00 stderr F goroutine 78 [running]: 2023-07-19T18:03:17.046485818+08:00 stderr F go.etcd.io/bbolt.(*DB).freepages.func2(0xc00007c5a0) 2023-07-19T18:03:17.046487431+08:00 stderr F /go/pkg/mod/go.etcd.io/[email protected]/db.go:1056 +0xe9 2023-07-19T18:03:17.046524506+08:00 stderr F created by go.etcd.io/bbolt.(*DB).freepages 2023-07-19T18:03:17.046528444+08:00 stderr F /go/pkg/mod/go.etcd.io/[email protected]/db.go:1054 +0x1cd
ectd 在读取数据时发生了错误,导致启动失败。继而api-server也无法启动
etcd的数据文件损坏了,要做数据恢复,而我这是实验环境,没搞etcd备份就只能重置集群了
注意,线上使用etcd一定要做高可用和定期备份,否则就悲催了
重置k8s集群
需要在每台机器上执行
kubeadm reset
删除$HOME/.kube
rm -rf $HOME/.kube
初始化集群
近期文章:K8s集群1.25.0版本安装-基于Containerd和网络插件calico
在master节点执行即可
kubeadm init --config=kubeadm.yaml --ignore-preflight-errors=SystemVerification
创建必要文件
mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
将工作节点重新加入集群
查看将工作节点加入集群的命令
kubeadm token create --print-join-command
工作节点node1集群机器
kubeadm join 192.168.5.132:6443 --token abcdef.0123456789abcdef \ --discovery-token-ca-cert-hash sha256:843aae90033aa0f5a3bff1fc8fc977aeea2f423e50b5b991dfb0f5c9971a3c1b
工作节点node2集群机器
kubeadm join 192.168.5.132:6443 --token abcdef.0123456789abcdef \ --discovery-token-ca-cert-hash sha256:843aae90033aa0f5a3bff1fc8fc977aeea2f423e50b5b991dfb0f5c9971a3c1b
查看node状态
kubectl get node
此时集群节点恢复正常了
但是查看kube-system系统组件缺少calico和coredns
重新安装calico
#下载calico yaml文件,calico文件不用修改,直接拿来用 wget https://docs.projectcalico.org/manifests/calico.yaml #安装calico kubectl apply -f calico.yaml #再次查看系统组件 kubectl get pod -n kube-system
测试calico和coredns
执行下面命令
#创建busybox容器,以供测试 kubectl run busybox --image docker.io/library/busybox:1.28 --image-pull-policy=IfNotPresent --restart=Never --rm -it busybox -- sh #测试calico ping www.baidu.com #测试dns nslookup kubernetes.default.svc.cluster.local
此时集群才算正常恢复正常了
如果想要每个节点都能执行kubectl 命令将master节点$HOME/.kube目录复制到其他节点就可以了
etcd备份和恢复
下一文在详细说明