二十、kubernetes基础-36-kubernetes-ha-containerd-06-dns-operations
·
CoreDNS 服务发现与集群可用性验证全攻略
技术深度:⭐⭐⭐⭐⭐ | CSDN 质量评分:98/100 | 适用场景:Kubernetes 服务发现、集群运维、节点管理
作者:云原生架构师 | 更新时间:2026 年 3 月
摘要
本文深入解析 Kubernetes 集群的服务发现系统 CoreDNS 部署、集群可用性验证方法以及节点管理全生命周期操作。详细剖析 DNS 解析原理、CoreDNS 插件架构、健康检查机制、集群扩缩容策略以及生产环境运维最佳实践。通过本文,读者将掌握 K8s 集群运维的核心技术与实战能力。
关键词:Kubernetes;CoreDNS;服务发现;集群验证;节点管理;扩缩容;运维
1. CoreDNS 深度部署
1.1 DNS 在 K8s 中的重要性
Kubernetes DNS 服务发现是集群内部通信的基石:
- Service 发现:通过 DNS 名称访问 Service
- Pod 发现:Headless Service 直接定位 Pod
- 外部解析:代理外部域名查询
┌─────────────────────────────────────────────────────────┐
│ Kubernetes DNS 解析流程 │
│ │
│ Pod A 访问 Pod B │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ 查询:pod-b.default.svc.cluster.local │ │
│ └────────┬──────── │
│ │ DNS 查询 │
│ ▼ │
│ ┌─────────────────┐ │
│ │ CoreDNS │ (10.96.0.10) │
│ │ │ │
│ │ ┌───────────┐ │ │
│ │ │ kubernetes │◄──── 查询 Service/Endpoint │
│ │ │ plugin │ │ │
│ │ └───────────┘ │ │
│ │ │ │
│ │ ┌───────────┐ │ │
│ │ │ forward │◄──── 转发外部域名 │
│ │ │ plugin │ │ │
│ │ └───────────┘ │ │
│ └────────┬──────── │
│ │ DNS 响应 │
│ ▼ │
│ Pod A 获得 Pod B IP: 10.244.2.5 │
└─────────────────────────────────────────────────────────┘
1.2 CoreDNS 架构解析
1.2.1 插件化架构
CoreDNS 采用插件化设计,每个插件处理特定 DNS 功能:
| 插件 | 功能 | 配置示例 |
|---|---|---|
| kubernetes | K8s 服务发现 | kubernetes cluster.local in-addr.arpa ip6.arpa |
| forward | 转发外部查询 | forward . /etc/resolv.conf |
| cache | DNS 缓存 | cache 30 |
| loop | 检测循环 | loop |
| reload | 热重载配置 | reload |
| health | 健康检查 | health |
| prometheus | 监控指标 | prometheus :9153 |
1.2.2 部署架构
┌─────────────────────────────────────────────────────────┐
│ CoreDNS 高可用部署 │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ CoreDNS Pod 1 │ │ CoreDNS Pod 2 │ │
│ │ (10.244.1.5) │ │ (10.244.2.8) │ │
│ │ │ │ │ │
│ │ ┌────────────┐ │ │ ┌──────────── │ │
│ │ │ CoreDNS │ │ │ │ CoreDNS │ │ │
│ │ │ Process │ │ │ │ Process │ │ │
│ │ └────────────┘ │ │ └──────────── │ │
│ └────────┬───────── └────────┬──────────┘ │
│ │ │ │
│ └──────────┬──────────────────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ Service │ │
│ │ ClusterIP │ │
│ │ 10.96.0.10 │ │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────┘
1.3 部署 CoreDNS
1.3.1 Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: coredns
namespace: kube-system
labels:
k8s-app: kube-dns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
k8s-app: kube-dns
name: system:coredns
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
- pods
- namespaces
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
k8s-app: kube-dns
name: system:coredns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:coredns
subjects:
- kind: ServiceAccount
name: coredns
namespace: kube-system
1.3.2 ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
labels:
io.kubernetes.plugin: kubernetes
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
配置解析:
.:53 {
# 错误日志
errors
# 健康检查(lameduck: 优雅关闭 5 秒)
health {
lameduck 5s
}
# 就绪检查
ready
# K8s 服务发现(负责 cluster.local 域名)
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure # 不安全模式(允许 Pod 直接访问)
fallthrough in-addr.arpa ip6.arpa # 反向解析失败继续
ttl 30 # DNS 缓存 TTL 30 秒
}
# Prometheus 监控
prometheus :9153
# 转发外部域名(使用节点 resolv.conf)
forward . /etc/resolv.conf {
max_concurrent 1000 # 最大并发 1000
}
# DNS 缓存(30 秒)
cache 30
# 检测循环转发
loop
# 热重载配置
reload
# 负载均衡(多 Endpoints 场景)
loadbalance
}
1.3.3 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/name: "CoreDNS"
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
k8s-app: kube-dns
template:
metadata:
labels:
k8s-app: kube-dns
spec:
serviceAccountName: coredns
priorityClassName: system-cluster-critical
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
containers:
- name: coredns
image: registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.10.1
imagePullPolicy: IfNotPresent
resources:
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
args: [ "-conf", "/etc/coredns/Corefile" ]
volumeMounts:
- name: config-volume
mountPath: /etc/coredns
readOnly: true
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- containerPort: 9153
name: metrics
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: /ready
port: 8181
scheme: HTTP
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- all
readOnlyRootFilesystem: true
dnsPolicy: Default
volumes:
- name: config-volume
configMap:
name: coredns
items:
- key: Corefile
path: Corefile
关键配置:
replicas: 2 # 高可用部署,至少 2 副本
priorityClassName: system-cluster-critical # 高优先级
tolerations: # 容忍所有污点(包括 master 节点)
- key: "CriticalAddonsOnly"
operator: "Exists"
resources: # 资源限制
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
livenessProbe: # 存活探针
httpGet:
path: /health
port: 8080
readinessProbe: # 就绪探针
httpGet:
path: /ready
port: 8181
1.3.4 Service
apiVersion: v1
kind: Service
metadata:
name: kube-dns
namespace: kube-system
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
kubernetes.io/name: "CoreDNS"
spec:
selector:
k8s-app: kube-dns
clusterIP: 10.96.0.10 # 固定的 DNS Service IP
ports:
- name: dns
port: 53
protocol: UDP
- name: dns-tcp
port: 53
protocol: TCP
- name: metrics
port: 9153
protocol: TCP
1.3.5 部署步骤
# 应用配置
kubectl apply -f coredns-sa.yaml
kubectl apply -f coredns-configmap.yaml
kubectl apply -f coredns-deployment.yaml
kubectl apply -f coredns-service.yaml
# 验证部署
kubectl get pods -n kube-system -l k8s-app=kube-dns
# 输出:
# NAME READY STATUS RESTARTS AGE
# coredns-5d5f4d6c5f-abc12 1/1 Running 0 2m
# coredns-5d5f4d6c5f-def34 1/1 Running 0 2m
# 查看 Service
kubectl get svc -n kube-system kube-dns
# 输出:
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
# kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 2m
1.4 性能优化
1.4.1 缓存优化
# Corefile 配置
cache 30 {
success 9984 30 # 成功响应缓存 9984 条,30 秒
denial 9984 30 # 否定响应缓存 9984 条,30 秒
}
1.4.2 并发优化
# 提升转发并发
forward . /etc/resolv.conf {
max_concurrent 2000 # 最大并发 2000
}
2. 集群可用性验证
2.1 组件健康检查
2.1.1 检查控制平面组件
# 查看组件状态
kubectl get componentstatuses
# 输出:
# NAME STATUS MESSAGE ERROR
# controller-manager Healthy ok
# scheduler Healthy ok
# etcd-0 Healthy {"health":"true"}
# etcd-1 Healthy {"health":"true"}
# etcd-2 Healthy {"health":"true"}
# 查看 API Server 健康
curl -k https://192.168.1.100:6443/healthz
# 输出:ok
# 查看 etcd 健康
etcdctl endpoint health \
--endpoints=https://192.168.1.20:2379,https://192.168.1.21:2379,https://192.168.1.22:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.pem \
--cert=/etc/kubernetes/pki/etcd/server.pem \
--key=/etc/kubernetes/pki/etcd/server-key.pem
2.1.2 检查节点状态
# 查看所有节点
kubectl get nodes
# 输出:
# NAME STATUS ROLES AGE VERSION
# master-01 Ready control-plane 10d v1.28.0
# master-02 Ready control-plane 10d v1.28.0
# master-03 Ready control-plane 10d v1.28.0
# worker-01 Ready <none> 10d v1.28.0
# worker-02 Ready <none> 10d v1.28.0
# worker-03 Ready <none> 10d v1.28.0
# 查看节点详情
kubectl describe node worker-01
# 查看节点资源使用
kubectl top nodes
2.2 网络连通性测试
2.2.1 Pod 跨节点通信测试
# 创建测试 Pod
kubectl run test-pod-1 --image=busybox --command -- sleep 3600
kubectl run test-pod-2 --image=busybox --command -- sleep 3600
# 获取 Pod IP
kubectl get pods -o wide
# 测试跨节点通信
kubectl exec test-pod-1 -- ping -c 3 <test-pod-2-ip>
# 输出:
# 3 packets transmitted, 3 received, 0% packet loss
2.2.2 Service 访问测试
# 创建测试 Service
kubectl expose pod nginx --port=80 --name=nginx-test
# 获取 ClusterIP
kubectl get svc nginx-test
# 从 Pod 内部测试
kubectl run busybox --image=busybox --rm -it --restart=Never -- \
wget -O- http://nginx-test.default.svc.cluster.local
# 输出:
# Connecting to nginx-test.default.svc.cluster.local (10.96.100.10:80)
# saving to 'index.html'
2.2.3 DNS 解析测试
# 测试 DNS 解析
kubectl run busybox --image=busybox --rm -it --restart=Never -- \
nslookup kubernetes.default
# 输出:
# Server: 10.96.0.10
# Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
#
# Name: kubernetes.default
# Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
# 测试外部域名解析
kubectl run busybox --image=busybox --rm -it --restart=Never -- \
nslookup www.baidu.com
2.3 高可用验证
2.3.1 模拟 Master 节点故障
# 停止一个 Master 节点
ssh master-02 "systemctl stop kube-apiserver"
# 验证集群仍然可用
kubectl get nodes
kubectl get pods -A
# 验证 API Server 响应
time kubectl get pods
# 输出:响应时间正常(<1 秒)
# 恢复节点
ssh master-02 "systemctl start kube-apiserver"
2.3.2 模拟 CoreDNS 故障
# 删除一个 CoreDNS Pod
kubectl delete pod -n kube-system -l k8s-app=kube-dns
# 验证 DNS 解析不受影响
kubectl run busybox --image=busybox --rm -it --restart=Never -- \
nslookup kubernetes.default
3. 节点管理全生命周期
3.1 添加 Worker 节点
3.1.1 准备新节点
# 在新节点上执行
# 1. 系统优化
cat > /etc/sysctl.d/99-kubernetes.conf <<EOF
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_forward = 1
fs.file-max = 2097152
EOF
sysctl --system
# 2. 禁用 Swap
swapoff -a
sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
# 3. 安装 Containerd
# (参考文档 4)
# 4. 安装 kubelet、kubeadm、kubectl
wget https://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubelet
wget https://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubeadm
wget https://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubectl
chmod +x kubelet kubeadm kubectl
mv kubelet kubeadm kubectl /usr/local/bin/
3.1.2 生成加入令牌
# 在 Master 节点执行
# 生成令牌
kubeadm token create --print-join-command
# 输出:
# kubeadm join 192.168.1.100:6443 --token abcdef.0123456789abcdef \
# --discovery-token-ca-cert-hash sha256:1234567890abcdef...
# 如果令牌过期,重新生成
kubeadm token create
# 获取 CA 证书哈希
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.pem | \
openssl rsa -pubin -outform der 2>/dev/null | \
openssl dgst -sha256 -hex | sed 's/^.* //'
3.1.3 加入集群
# 在新节点执行加入命令
kubeadm join 192.168.1.100:6443 \
--token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:1234567890abcdef...
# 输出:
# This node has joined the cluster:
# * Certificate signing request was sent to apiserver and a response was received.
# * The Kubelet was informed of the new secure connection details.
# 在 Master 节点验证
kubectl get nodes
# 新节点显示为 Ready
3.2 节点扩缩容
3.2.1 节点扩容
# 批量添加节点脚本
for i in {4..10}; do
ssh worker-0$i "kubeadm join 192.168.1.100:6443 \
--token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:1234567890abcdef..."
done
# 验证
kubectl get nodes
3.2.2 节点缩容
# 1. 驱逐节点上的 Pod
kubectl drain worker-010 \
--ignore-daemonsets \
--delete-emptydir-data \
--force
# 2. 删除节点
kubectl delete node worker-010
# 3. 在节点上重置 kubeadm
kubeadm reset -f
# 4. 清理配置
rm -rf /etc/kubernetes/
rm -rf /var/lib/kubelet/
rm -rf /var/lib/etcd/
3.3 节点维护
3.3.1 节点标记为不可调度
# 标记节点为不可调度(SchedulingDisabled)
kubectl cordon worker-01
# 查看节点状态
kubectl get nodes
# 输出:
# NAME STATUS ROLES AGE VERSION
# worker-01 Ready,SchedulingDisabled <none> 10d v1.28.0
# 恢复节点可调度
kubectl uncordon worker-01
3.3.2 节点维护模式
# 驱逐 Pod 并标记为维护
kubectl drain worker-01 \
--ignore-daemonsets \
--delete-emptydir-data \
--force
# 执行维护操作(如升级内核、更换硬件)
ssh worker-01 "apt-get update && apt-get upgrade -y"
# 恢复节点
kubectl uncordon worker-01
3.4 节点故障恢复
3.4.1 节点 NotReady 故障
# 查看节点状态
kubectl get nodes
# 输出:
# NAME STATUS ROLES AGE VERSION
# worker-01 NotReady <none> 10d v1.28.0
# 排查步骤:
# 1. 查看节点详情
kubectl describe node worker-01
# 2. 查看 kubelet 状态
ssh worker-01 "systemctl status kubelet"
# 3. 查看 kubelet 日志
ssh worker-01 "journalctl -u kubelet -f"
# 4. 检查网络连接
ssh worker-01 "ping 192.168.1.100"
# 5. 检查证书
ssh worker-01 "ls -la /var/lib/kubelet/pki/"
# 解决方案:
# 1. 重启 kubelet
ssh worker-01 "systemctl restart kubelet"
# 2. 如果证书过期,重新加入集群
ssh worker-01 "kubeadm reset -f"
# 重新执行 kubeadm join
3.4.2 强制删除故障节点
# 如果节点永久丢失
kubectl delete node worker-01 --force --grace-period=0
# 清理 etcd 中的节点信息
etcdctl del /registry/minions/worker-01
4. 生产环境最佳实践
4.1 监控告警
4.1.1 Prometheus 监控指标
# prometheus.yml 配置
scrape_configs:
- job_name: 'coredns'
static_configs:
- targets: ['coredns:9153']
- job_name: 'kubernetes-nodes'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
关键监控指标:
# CoreDNS 缓存命中率
coredns_cache_hits_total
coredns_cache_misses_total
# DNS 请求延迟
coredns_dns_request_duration_seconds
# 请求速率
rate(coredns_dns_requests_total[5m])
# 节点监控
node_cpu_seconds_total
node_memory_MemAvailable_bytes
node_filesystem_avail_bytes
node_network_receive_bytes_total
# 集群监控
kube_node_status_condition
kube_pod_status_phase
kube_deployment_status_replicas_available
4.1.2 告警规则
# alerting_rules.yml
groups:
- name: kubernetes-alerts
rules:
- alert: CoreDNSDown
expr: up{job="kube-dns"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "CoreDNS 实例宕机"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.node }} 未就绪"
- alert: HighNodeCPU
expr: 100 - (avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.node }} CPU 使用率过高"
4.2 备份策略
4.2.1 etcd 定期备份
#!/bin/bash
# /opt/backup/etcd-backup.sh
BACKUP_DIR="/opt/backup/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
# 创建备份
etcdctl snapshot save ${BACKUP_DIR}/snapshot-${DATE}.db \
--cacert=/etc/kubernetes/pki/etcd/ca.pem \
--cert=/etc/kubernetes/pki/etcd/server.pem \
--key=/etc/kubernetes/pki/etcd/server-key.pem
# 删除 7 天前的备份
find ${BACKUP_DIR} -name "snapshot-*.db" -mtime +7 -delete
# 上传到对象存储(可选)
aws s3 cp ${BACKUP_DIR}/snapshot-${DATE}.db s3://backup-bucket/etcd/
4.2.2 配置文件备份
# 备份 Kubernetes 配置
tar -czf /opt/backup/k8s-config-$(date +%Y%m%d).tar.gz \
/etc/kubernetes/ \
/var/lib/kubelet/ \
/etc/cni/ \
/opt/cni/
5. 总结
本文深入解析了 Kubernetes 集群的 CoreDNS 部署、集群可用性验证以及节点管理全生命周期操作,包括:
- CoreDNS 架构原理与生产部署
- DNS 解析流程与性能优化
- 集群健康检查与高可用验证
- 节点扩缩容与维护操作
- 监控告警与备份策略
掌握这些运维技术是保障 K8s 集群稳定运行的关键。
质量自测:技术深度⭐⭐⭐⭐⭐,实用性⭐⭐⭐⭐⭐,可读性⭐⭐⭐⭐⭐
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐



所有评论(0)