搭建 DevOps 企业级仿真实验环境:013全节点统一基础配置与 kubeconfig 批量分发
前置确认:所有 9 个节点的 SSH 免密登录已全部配置完成,后续所有操作均在ControlNodeA([192.168.0.151](192.168.0.151)) 上远程批量执行,无需手动登录任何其他节点。
本文将基于已完成的 SSH 免密环境,一次性完成8 个未配置节点的系统基础初始化 + containerd v2.2.1 统一安装,再进行 kubeconfig 权限分级配置与批量分发,确保所有节点满足 Kubernetes 集群加入的强制前置条件。
一、实验环境与节点配置分工
本实验基于 Proxmox VE 虚拟化平台搭建,所有节点均为 Ubuntu 22.04 LTS 操作系统,节点规划与配置要求如下:
|
主机名 |
IP 地址 |
角色 |
需完成配置项 |
|
ControlNodeA |
[192.168.0.151](192.168.0.151) |
控制平面主节点(已完成) |
仅需执行批量配置脚本 |
|
ControlNodeB |
[192.168.0.152](192.168.0.152) |
控制平面副本节点 |
系统初始化 + containerd+admin kubeconfig |
|
ControlNodeC |
[192.168.0.153](192.168.0.153) |
控制平面副本节点 |
系统初始化 + containerd+admin kubeconfig |
|
WorkNodeA |
[192.168.0.154](192.168.0.154) |
业务工作节点 |
系统初始化 + containerd |
|
WorkNodeB |
[192.168.0.155](192.168.0.155) |
业务工作节点 |
系统初始化 + containerd |
|
DataMidNode |
[192.168.0.156](192.168.0.156) |
中间件 / 数据存储节点 |
系统初始化 + containerd |
|
DevOpsToolNode |
[192.168.0.157](192.168.0.157) |
统一运维跳板机 |
系统初始化 + kubectl+ops kubeconfig |
|
ObservabilityNode |
[192.168.0.158](192.168.0.158) |
可观测性节点 |
系统初始化 + containerd |
|
DSDRNode |
[192.168.0.159](192.168.0.159) |
灾备 / 数据持久化节点 |
系统初始化 + containerd |
|
核心原则:所有节点的系统参数、内核模块、containerd 版本与配置必须 100% 一致,这是 Kubernetes 集群稳定运行的根本保障。 |
二、全节点统一系统基础配置
所有节点必须完成以下 5 项系统配置,缺一不可:
- 更新软件源并安装基础依赖
- 永久关闭 Swap 分区(K8s 强制要求)
- 加载 overlay 和 br_netfilter 内核模块
- 开启桥接网络 iptables 转发
- 设置正确的主机名并绑定 hosts
2.1 创建全节点清单文件
首先在 ControlNodeA 上创建包含所有节点信息的清单文件,用于后续批量操作:
|
bash 192.168.0.151 ControlNodeA control 192.168.0.152 ControlNodeB control 192.168.0.153 ControlNodeC control 192.168.0.154 WorkNodeA worker 192.168.0.155 WorkNodeB worker 192.168.0.156 DataMidNode worker 192.168.0.157 DevOpsToolNode ops 192.168.0.158 ObservabilityNode worker 192.168.0.159 DSDRNode worker EOF |
2.2 编写系统初始化自动化脚本
创建init_node_system.sh脚本,包含所有系统配置步骤,采用幂等设计,重复执行不会出错:
|
bash |
2.3 批量执行系统初始化
给脚本添加执行权限后,批量在所有未配置节点上执行:
|
bash if [ "${hostname}" != "ControlNodeA" ]; then echo "=====================================" echo "正在远程初始化: ${hostname} (${ip})" echo "=====================================" # 复制脚本到目标节点 scp init_node_system.sh jack@${ip}:/home/jack/ # 远程执行脚本并传入主机名参数 # 关键:加 -n 防止 ssh 吃掉 while read 的输入 ssh -n jack@${ip} "sudo bash /home/jack/init_node_system.sh ${hostname}" echo "节点 ${hostname} 初始化完成!" echo "" fi done < all_nodes.txt |

2.4 批量验证系统配置
创建verify_cluster.sh脚本,执行以下命令一次性验证所有 8 个节点的系统配置是否正确:
|
bash # 颜色定义 RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' NC='\033[0m' echo "=====================================" echo "全节点系统配置批量验证" echo "=====================================" if [ ! -f all_nodes.txt ]; then echo -e "${RED}错误: all_nodes.txt 文件不存在${NC}" exit 1 fi total=0 success=0 failed=0 while read -r ip hostname role; do # 跳过空行和注释行 [[ -z "$ip" || "$ip" =~ ^# ]] && continue
if [ "${hostname}" != "ControlNodeA" ]; then total=$((total + 1)) echo "-------------------------------------" echo -e "${YELLOW}节点: ${hostname} (${ip})${NC}" echo "-------------------------------------"
# 先测试连通性 if ping -c 1 -W 2 ${ip} > /dev/null 2>&1; then # 使用 jack 账号连接,关键命令通过 sudo 执行 if ssh -n -o ConnectTimeout=5 -o StrictHostKeyChecking=no jack@${ip} bash << 'EOF' 2>/dev/null; then echo "主机名: $(hostname)" echo "Swap状态: " free -h | grep Swap echo "内核模块: " sudo lsmod | grep -E 'overlay|br_netfilter' 2>/dev/null || echo " 需要sudo权限查看内核模块" echo "网络参数: " sudo sysctl net.bridge.bridge-nf-call-iptables net.ipv4.ip_forward 2>/dev/null || echo " 需要sudo权限查看网络参数" EOF success=$((success + 1)) echo -e "${GREEN}✓ ${hostname} 验证完成${NC}" else failed=$((failed + 1)) echo -e "${RED}✗ ${hostname} SSH连接失败${NC}" fi else failed=$((failed + 1)) echo -e "${RED}✗ ${hostname} 无法ping通${NC}" fi
echo "" fi done < all_nodes.txt echo "=====================================" echo -e "验证结果: 总计 ${total} | ${GREEN}成功 ${success}${NC} | ${RED}失败 ${failed}${NC}" echo "=====================================" # 给脚本添加执行权限 # 执行脚本 |

三、全节点统一安装配置 containerd v2.2.1
所有节点必须安装相同版本的 containerd,并完成以下关键配置:
- 启用 SystemdCgroup 驱动(与 K8s cgroup 管理器一致)
- 替换 pause 镜像为阿里云镜像(解决国内下载超时)
- 配置开机自启并验证服务状态
- 安装 crictl 工具用于本地容器调试
3.1 编写 containerd 安装配置脚本
创建install_containerd.sh脚本,采用国内镜像加速,确保安装速度:
|
bash set -euo pipefail CONTAINERD_VERSION="2.2.1" PAUSE_IMAGE="registry.aliyuncs.com/google_containers/pause:3.9" CRICTL_VERSION="v1.30.0" echo "=====================================" echo "开始安装containerd v${CONTAINERD_VERSION}" echo "=====================================" # 1. 安装containerd echo "1. 安装containerd..." sudo apt update -y sudo apt install -y containerd # 2. 生成并修改默认配置 echo "2. 生成并优化配置文件..." sudo mkdir -p /etc/containerd containerd config default | sudo tee /etc/containerd/config.toml # 启用SystemdCgroup驱动(必须) sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml # 替换pause镜像为阿里云镜像 sudo sed -i "s|sandbox = '.*'|sandbox = '${PAUSE_IMAGE}'|" /etc/containerd/config.toml # 3. 重启服务并设置开机自启 echo "3. 启动containerd服务..." sudo systemctl daemon-reload sudo systemctl enable --now containerd sudo systemctl restart containerd # 4. 安装crictl工具 echo "4. 安装crictl ${CRICTL_VERSION}..." sudo wget -q https://ghproxy.net/https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-amd64.tar.gz -O crictl.tar.gz sudo tar zxvf crictl.tar.gz -C /usr/local/bin/ rm -f crictl.tar.gz # 5. 配置crictl连接containerd echo "5. 配置crictl连接..." cat <<EOF | sudo tee /etc/crictl.yaml runtime-endpoint: unix:///run/containerd/containerd.sock image-endpoint: unix:///run/containerd/containerd.sock timeout: 10 debug: false EOF # 6. 验证安装 echo "=====================================" echo "containerd安装完成,验证结果:" echo "=====================================" echo "containerd版本: $(containerd --version | awk '{print $3}')" echo "服务状态: $(systemctl is-active containerd)" echo "SystemdCgroup: $(grep SystemdCgroup /etc/containerd/config.toml | awk '{print $3}')" echo "Pause镜像: $(grep sandbox /etc/containerd/config.toml | awk -F\" '{print $2}')" echo "crictl连接: $(crictl info > /dev/null && echo '成功' || echo '失败')" echo "=====================================" echo "containerd安装配置成功!" echo "=====================================" |
3.2 批量安装配置 containerd
在批量安装配置 containerd前,先手动配置单个节点 sudo 免密,对每个需要配置的节点执行:
|
bash ssh jack@192.168.0.152 # 在远程节点上执行 echo 'jack ALL=(ALL) NOPASSWD: ALL' | sudo tee /etc/sudoers.d/jack-nopasswd sudo chmod 440 /etc/sudoers.d/jack-nopasswd # 验证 sudo -n true && echo "配置成功" || echo "配置失败" # 退出 exit |
批量安装配置 containerd
|
bash # 颜色定义 RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' BLUE='\033[0;34m' NC='\033[0m' # 安装脚本路径 INSTALL_SCRIPT="install_containerd.sh" # 日志目录(使用绝对路径避免切换目录问题) LOG_DIR="${HOME}/install_log" echo "=====================================" echo -e "${BLUE}批量安装 containerd${NC}" echo "=====================================" # 检查安装脚本是否存在 if [ ! -f "${INSTALL_SCRIPT}" ]; then echo -e "${RED}错误: ${INSTALL_SCRIPT} 文件不存在${NC}" exit 1 fi # 检查节点列表 if [ ! -f all_nodes.txt ]; then echo -e "${RED}错误: all_nodes.txt 文件不存在${NC}" exit 1 fi # 创建日志目录(强制创建,包括父目录) mkdir -p "${LOG_DIR}" || { echo -e "${RED}错误: 无法创建日志目录 ${LOG_DIR}${NC}" exit 1 } # 验证日志目录可写 if [ ! -w "${LOG_DIR}" ]; then echo -e "${RED}错误: 日志目录 ${LOG_DIR} 不可写${NC}" exit 1 fi # 统计变量 total=0 success=0 failed=0 declare -a failed_nodes # 记录开始时间 start_time=$(date +%s) while read -r ip hostname role; do # 跳过空行和注释 [[ -z "$ip" || "$ip" =~ ^# ]] && continue
if [ "${hostname}" != "ControlNodeA" ]; then total=$((total + 1)) # 生成日志文件名 log_file="${LOG_DIR}/${hostname}_$(date +%Y%m%d_%H%M%S).log"
# 预先创建日志文件 touch "${log_file}" || { echo -e "${RED}✗ 无法创建日志文件: ${log_file}${NC}" failed=$((failed + 1)) failed_nodes+=("${hostname} (日志文件创建失败)") continue }
echo "=====================================" echo -e "${YELLOW}[${total}] 正在安装 containerd 到: ${hostname} (${ip})${NC}" echo "====================================="
# 1. 测试连通性 if ! ping -c 1 -W 2 ${ip} > /dev/null 2>&1; then echo -e "${RED}✗ ${hostname} 无法 ping 通${NC}" | tee -a "${log_file}" failed=$((failed + 1)) failed_nodes+=("${hostname} (网络不通)") continue fi
# 2. 复制安装脚本 echo -e "${BLUE}[1/2] 复制安装脚本...${NC}" | tee -a "${log_file}"
# 确保远程主机上存在 /home/jack 目录(静默执行) ssh -n -o ConnectTimeout=5 jack@${ip} "mkdir -p /home/jack" >> "${log_file}" 2>&1
if scp -o ConnectTimeout=5 "${INSTALL_SCRIPT}" jack@${ip}:/home/jack/ >> "${log_file}" 2>&1; then echo -e "${GREEN}✓ 脚本复制成功${NC}" | tee -a "${log_file}" else echo -e "${RED}✗ 脚本复制失败${NC}" | tee -a "${log_file}" failed=$((failed + 1)) failed_nodes+=("${hostname} (脚本复制失败)") continue fi
# 3. 远程执行安装(关键修改:使用 tee 同时显示和记录) echo -e "${BLUE}[2/2] 执行安装脚本...${NC}" | tee -a "${log_file}" echo -e "${BLUE}-------------------------------------${NC}" | tee -a "${log_file}"
# 使用进程替换捕获退出码,同时用 tee 输出 exec 3>&1 # 保存标准输出 if ssh -n -o ConnectTimeout=10 jack@${ip} "sudo bash /home/jack/install_containerd.sh" 2>&1 | tee -a "${log_file}"; then echo -e "${BLUE}-------------------------------------${NC}" | tee -a "${log_file}" success=$((success + 1)) echo -e "${GREEN}✓ ${hostname} containerd 安装成功${NC}" | tee -a "${log_file}" else echo -e "${BLUE}-------------------------------------${NC}" | tee -a "${log_file}" failed=$((failed + 1)) failed_nodes+=("${hostname} (安装失败,查看日志: ${log_file})") echo -e "${RED}✗ ${hostname} containerd 安装失败${NC}" | tee -a "${log_file}" echo -e "${YELLOW} 详细日志: ${log_file}${NC}" fi
echo "" fi done < all_nodes.txt # 计算耗时 end_time=$(date +%s) duration=$((end_time - start_time)) # 输出汇总 echo "=====================================" echo -e "${BLUE}安装结果汇总${NC}" echo "=====================================" echo -e "总节点数: ${total}" echo -e "${GREEN}成功: ${success}${NC}" echo -e "${RED}失败: ${failed}${NC}" echo -e "耗时: ${duration} 秒" echo "" # 显示失败节点 if [ ${#failed_nodes[@]} -gt 0 ]; then echo -e "${RED}失败节点列表:${NC}" for node in "${failed_nodes[@]}"; do echo -e " ${RED}✗${NC} ${node}" done echo "" echo -e "${YELLOW}提示: 查看详细日志获取更多信息${NC}" echo -e "日志目录: ${LOG_DIR}" fi echo "=====================================" # 给脚本添加执行权限 # 执行脚本 |

3.3 批量验证 containerd 配置
|
bash # 颜色定义 RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' BLUE='\033[0;34m' NC='\033[0m' echo "=====================================" echo -e "${BLUE}全节点 containerd 配置批量验证${NC}" echo "=====================================" total=0 success=0 failed=0 while read -r ip hostname role; do [[ -z "$ip" || "$ip" =~ ^# ]] && continue [ "${hostname}" == "ControlNodeA" ] && continue
total=$((total + 1))
echo "-------------------------------------" echo -e "${YELLOW}节点: ${hostname} (${ip})${NC}" echo "-------------------------------------"
# 测试连通性 if ! ping -c 1 -W 2 ${ip} > /dev/null 2>&1; then echo -e "${RED}✗ 无法 ping 通${NC}" failed=$((failed + 1)) echo "" continue fi
# 使用 jack 连接执行验证 ssh -n -o ConnectTimeout=5 jack@${ip} bash << 'EOF' 2>/dev/null echo "containerd版本: " containerd --version 2>/dev/null || echo " 未安装" echo "服务状态: " sudo systemctl is-active containerd 2>/dev/null || echo " 服务未运行" sudo systemctl is-enabled containerd 2>/dev/null || echo " 未设置开机自启" echo "关键配置: " sudo grep -E 'SystemdCgroup|sandbox' /etc/containerd/config.toml 2>/dev/null || echo " 配置文件不存在或无权限" echo "crictl连接: " if command -v crictl > /dev/null 2>&1; then sudo crictl info > /dev/null 2>&1 && echo " 连接成功" || echo " 连接失败" else echo " crictl 未安装" fi EOF
if [ $? -eq 0 ]; then echo -e "${GREEN}✓ ${hostname} 验证完成${NC}" success=$((success + 1)) else echo -e "${RED}✗ ${hostname} 连接失败${NC}" failed=$((failed + 1)) fi
echo "" done < all_nodes.txt echo "=====================================" echo -e "${BLUE}验证结果汇总${NC}" echo "=====================================" echo -e "总计: ${total} | ${GREEN}成功: ${success}${NC} | ${RED}失败: ${failed}${NC}" echo "=====================================" # 给脚本添加执行权限 # 执行脚本 |

四、kubeconfig 权限分级配置与批量分发
完成所有节点基础配置后,进行 kubeconfig 的权限配置与分发。仅在控制平面节点和运维跳板机上安装 kubectl,工作节点不安装任何集群管控工具。
4.1 在 ControlNodeA 生成多角色 kubeconfig
步骤 1:备份原始管理员配置
|
bash |
步骤 2:生成集群运维角色(cluster-ops)kubeconfig
创建专门用于日常运维的 ServiceAccount,避免直接使用超级管理员权限:
|
bash echo "=====================================" echo "创建 cluster-ops 账号并生成 kubeconfig" echo "=====================================" # 1. 创建 ServiceAccount echo "1. 创建 ServiceAccount..." kubectl create serviceaccount cluster-ops -n kube-system --dry-run=client -o yaml | kubectl apply -f - # 2. 创建 ClusterRoleBinding echo "2. 创建 ClusterRoleBinding..." kubectl create clusterrolebinding cluster-ops-admin \ --clusterrole=cluster-admin \ --serviceaccount=kube-system:cluster-ops \ --dry-run=client -o yaml | kubectl apply -f - # 3. 创建 Token Secret(使用 tee 方式) echo "3. 创建 Token Secret..." cat <<EOF | tee /tmp/cluster-ops-secret.yaml apiVersion: v1 kind: Secret metadata: name: cluster-ops-token namespace: kube-system annotations: kubernetes.io/service-account.name: cluster-ops type: kubernetes.io/service-account-token EOF kubectl apply -f /tmp/cluster-ops-secret.yaml rm -f /tmp/cluster-ops-secret.yaml # 4. 等待 Token 生成 echo "4. 等待 Token 生成..." for i in {1..10}; do OPS_TOKEN=$(kubectl get secret cluster-ops-token -n kube-system -o jsonpath='{.data.token}' 2>/dev/null | base64 --decode) if [ -n "$OPS_TOKEN" ]; then echo "✓ Token 获取成功(第 ${i} 次尝试)" break fi echo " 等待中... (${i}/10)" sleep 3 done if [ -z "$OPS_TOKEN" ]; then echo "错误: 无法获取 Token,请检查 Secret 状态" kubectl describe secret cluster-ops-token -n kube-system exit 1 fi # 5. 获取 CA 证书 echo "5. 获取 CA 证书..." CA_CERT=$(kubectl config view --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}') if [ -z "$CA_CERT" ]; then echo "错误: 无法获取 CA 证书" exit 1 fi echo "✓ CA 证书获取成功" # 6. 获取 API Server 地址 APISERVER=$(kubectl config view --raw -o jsonpath='{.clusters[0].cluster.server}') echo "✓ API Server: ${APISERVER}" # 7. 生成 kubeconfig 文件 echo "6. 生成 kubeconfig 文件..." cat <<EOF | tee /home/jack/cluster-ops.kubeconfig > /dev/null apiVersion: v1 kind: Config clusters: - cluster: certificate-authority-data: ${CA_CERT} server: ${APISERVER} name: k8s-cluster contexts: - context: cluster: k8s-cluster user: cluster-ops name: cluster-ops@k8s-cluster current-context: cluster-ops@k8s-cluster users: - name: cluster-ops user: token: ${OPS_TOKEN} EOF echo "✓ kubeconfig 文件已生成: /home/jack/cluster-ops.kubeconfig" # 8. 验证 echo "" echo "=====================================" echo "验证 cluster-ops 账号" echo "=====================================" echo "集群信息:" kubectl --kubeconfig=/home/jack/cluster-ops.kubeconfig cluster-info 2>&1 | head -3 echo "" echo "节点列表:" kubectl --kubeconfig=/home/jack/cluster-ops.kubeconfig get nodes echo "" echo "=====================================" echo "完成!" echo "=====================================" echo "kubeconfig 文件: /home/jack/cluster-ops.kubeconfig" echo "" echo "使用方式:" echo " kubectl --kubeconfig=/home/jack/cluster-ops.kubeconfig get nodes" echo " 或" echo " export KUBECONFIG=/home/jack/cluster-ops.kubeconfig" echo "=====================================" |

步骤 3:准备管理员 kubeconfig
控制平面节点需要完整的集群管理权限,直接使用原始的 admin.conf:
|
bash |
4.2 编写 kubectl 与 kubeconfig 批量分发脚本
强制关闭并删除 Swap
|
bash #!/bin/bash # force_disable_swap_v2.sh echo "=====================================" echo "强制关闭并禁用所有节点 Swap" echo "=====================================" while read -r ip hostname role; do [[ -z "$ip" || "$ip" =~ ^# ]] && continue [ "${hostname}" == "ControlNodeA" ] && continue
echo "-------------------------------------" echo "处理: ${hostname} (${ip})"
ssh -t jack@${ip} bash << 'EOF' echo "=== 关闭前 ===" free | grep Swap sudo swapon --show
echo "" echo "=== 关闭 Swap ===" sudo swapoff -a -v
echo "" echo "=== 注释 fstab ===" sudo cp /etc/fstab /etc/fstab.bak sudo sed -i '/swap/s/^/# /' /etc/fstab echo "fstab 内容:" grep swap /etc/fstab
echo "" echo "=== 删除 swap 文件 ===" sudo rm -f /swap.img /swapfile
echo "" echo "=== 关闭后 ===" free | grep Swap sudo swapon --show 2>&1 || echo "无 swap 设备" EOF
echo "" echo "✓ ${hostname} 处理完成" echo "" done < all_nodes.txt echo "=====================================" echo "完成!验证结果:" echo "=====================================" # 验证所有节点 while read -r ip hostname role; do [[ -z "$ip" || "$ip" =~ ^# ]] && continue [ "${hostname}" == "ControlNodeA" ] && continue
result=$(ssh -n jack@${ip} "free | grep Swap | awk '{print \$2}'") if [ "$result" -eq 0 ]; then echo "✓ ${hostname}: Swap 已完全关闭" else echo "✗ ${hostname}: Swap 仍有 ${result} KB" fi done < all_nodes.txt # 运行修复脚本 chmod +x force_disable_swap_v2.sh ./force_disable_swap_v2.sh |
创建deploy_kubectl.sh脚本,根据节点角色自动分发对应的配置:
|
bash #!/bin/bash # ============================================ # 颜色定义 # ============================================ RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' BLUE='\033[0;34m' NC='\033[0m' # ============================================ # 变量定义 # ============================================ KUBECTL_BIN="/usr/bin/kubectl" ADMIN_KUBECONFIG="./admin.kubeconfig" OPS_KUBECONFIG="./cluster-ops.kubeconfig" ALL_NODES="./all_nodes.txt" SSH_USER="jack" echo "=====================================" echo -e "${BLUE}部署 kubectl 和 kubeconfig${NC}" echo "=====================================" echo "" # 检查文件 for file in ${KUBECTL_BIN} ${ADMIN_KUBECONFIG} ${OPS_KUBECONFIG} ${ALL_NODES}; do if [ ! -f ${file} ]; then echo -e "${RED}错误:文件 ${file} 不存在!${NC}" exit 1 fi echo -e "${GREEN}✓${NC} ${file}" done echo "" while read -r ip hostname role; do [[ -z "$ip" || "$ip" =~ ^# ]] && continue [ "${hostname}" == "ControlNodeA" ] && continue # 本地已有
echo "=====================================" echo -e "${YELLOW}处理节点: ${hostname} (${ip}) 角色: ${role}${NC}" echo "====================================="
# ============================================ # 工作节点:不安装任何东西 # ============================================ if [ "${role}" == "worker" ]; then echo -e "${YELLOW}跳过(工作节点,不安装集群管控工具)${NC}" echo "" continue fi
# ============================================ # control / ops 节点:安装 kubectl # ============================================ echo -e "${BLUE}[1/3] 安装 kubectl...${NC}" if ! scp ${KUBECTL_BIN} ${SSH_USER}@${ip}:/home/${SSH_USER}/kubectl 2>/dev/null; then echo -e "${RED}✗ kubectl 复制失败${NC}" continue fi ssh -n ${SSH_USER}@${ip} "sudo mv /home/${SSH_USER}/kubectl /usr/local/bin/kubectl && sudo chmod +x /usr/local/bin/kubectl" echo -e "${GREEN}✓ kubectl 安装完成${NC}"
# ============================================ # 创建 .kube 目录 # ============================================ echo -e "${BLUE}[2/3] 分发 kubeconfig...${NC}" ssh -n ${SSH_USER}@${ip} "mkdir -p /home/${SSH_USER}/.kube && chmod 700 /home/${SSH_USER}/.kube"
# control 节点 → admin.kubeconfig,ops 节点 → cluster-ops.kubeconfig if [ "${role}" == "control" ]; then scp ${ADMIN_KUBECONFIG} ${SSH_USER}@${ip}:/home/${SSH_USER}/.kube/config 2>/dev/null echo -e " → 分发 ${GREEN}管理员${NC} kubeconfig" elif [ "${role}" == "ops" ]; then scp ${OPS_KUBECONFIG} ${SSH_USER}@${ip}:/home/${SSH_USER}/.kube/config 2>/dev/null echo -e " → 分发 ${YELLOW}运维${NC} kubeconfig" fi
# ============================================ # 设置权限和自动补全 # ============================================ echo -e "${BLUE}[3/3] 配置环境...${NC}" ssh -n ${SSH_USER}@${ip} bash << 'EOF' chmod 600 /home/jack/.kube/config kubectl completion bash > /home/jack/.kube/completion.bash if ! grep -q 'kubectl completion' /home/jack/.bashrc; then echo 'source /home/jack/.kube/completion.bash' >> /home/jack/.bashrc fi echo "✓ 配置完成" EOF
echo -e "${GREEN}✓ ${hostname} 部署完成${NC}" echo "" done < ${ALL_NODES} echo "=====================================" echo -e "${GREEN}部署完成!${NC}" echo "=====================================" echo "" echo "权限分配总结:" echo " 控制平面节点 (ControlNodeB, ControlNodeC) → admin.kubeconfig(集群管理员)" echo " 运维跳板机 (DevOpsToolNode) → cluster-ops.kubeconfig(运维权限)" echo " 工作节点 (其余5个) → 未安装 kubectl" echo "" echo "验证命令:" echo " ./verify_kubectl.sh" echo "=====================================" |
4.3 执行批量分发
|
bash |

创建verify_kubectl.sh(验证脚本)
|
bash #!/bin/bash echo "=====================================" echo "kubectl 部署验证" echo "=====================================" printf "%-20s %-10s %-15s %-25s\n" "主机名" "角色" "kubectl" "kubeconfig" echo "-------------------------------------------------------------------------" while read -r ip hostname role; do [[ -z "$ip" || "$ip" =~ ^# ]] && continue
# 本地 ControlNodeA if [ "${hostname}" == "ControlNodeA" ]; then if command -v kubectl &>/dev/null; then ver=$(kubectl version --client -o json 2>/dev/null | grep gitVersion | awk -F'"' '{print $4}') [ -z "$ver" ] && ver="已安装" else ver="未安装" fi printf "%-20s %-10s %-15s %-25s\n" "${hostname}" "control" "${ver}" "admin" continue fi
# 工作节点 if [ "${role}" == "worker" ]; then printf "%-20s %-10s %-15s %-25s\n" "${hostname}" "worker" "—" "—" continue fi
# control / ops 节点 result=$(ssh -n -o ConnectTimeout=5 -o BatchMode=yes jack@${ip} ' if command -v kubectl >/dev/null 2>&1; then ver=$(kubectl version --client -o json 2>/dev/null | grep gitVersion | awk -F"\"" "{print \$4}") [ -n "$ver" ] && echo "kubectl:$ver" || echo "kubectl:已安装" else echo "kubectl:未安装" fi
if [ -f /home/jack/.kube/config ]; then kubectl get nodes >/dev/null 2>&1 && echo "config:admin" || echo "config:已配置" else echo "config:无" fi ' 2>/dev/null)
if [ -z "$result" ]; then printf "%-20s %-10s %-15s %-25s\n" "${hostname}" "${role}" "连接失败" "连接失败" continue fi
kubectl_ver=$(echo "$result" | grep "kubectl:" | cut -d: -f2-) config_type=$(echo "$result" | grep "config:" | cut -d: -f2-)
printf "%-20s %-10s %-15s %-25s\n" "${hostname}" "${role}" "${kubectl_ver:-未知}" "${config_type:-未知}"
done < all_nodes.txt echo "" echo "=====================================" echo "验证完成!" echo "=====================================" # 给脚本添加执行权限 |

五、最终全节点状态验收
执行以下命令,一次性完成所有 9 个节点的最终状态验收,确保所有配置符合要求:
|
bash echo "=====================================" echo "全节点最终状态验收报告" echo "=====================================" echo "" printf "%-20s %-15s %-10s %-12s %-12s %-10s\n" "主机名" "IP地址" "角色" "Swap" "containerd" "kubectl" echo "--------------------------------------------------------------------------------" while read -r ip hostname role; do [[ -z "$ip" || "$ip" =~ ^# ]] && continue
if [ "${hostname}" == "ControlNodeA" ]; then swap_kb=$(free | grep Swap | awk '{print $2}') [ "$swap_kb" -eq 0 ] && swap_status="已关闭" || swap_status="$swap_kb KB" containerd_status=$(systemctl is-active containerd 2>/dev/null || echo "未运行") kubectl_status=$(command -v kubectl &>/dev/null && kubectl get nodes &>/dev/null && echo "正常" || echo "未安装") else # 每条命令单独 SSH(简单可靠) swap_kb=$(ssh -n -o ConnectTimeout=5 jack@${ip} "free | grep Swap | awk '{print \$2}'" 2>/dev/null) containerd_val=$(ssh -n -o ConnectTimeout=5 jack@${ip} "sudo systemctl is-active containerd 2>/dev/null" 2>/dev/null) kubectl_val=$(ssh -n -o ConnectTimeout=5 jack@${ip} "command -v kubectl >/dev/null 2>&1 && kubectl get nodes >/dev/null 2>&1 && echo '正常' || echo '未安装'" 2>/dev/null)
if [ -n "$swap_kb" ]; then [ "$swap_kb" -eq 0 ] && swap_status="已关闭" || swap_status="$swap_kb KB" else swap_status="连接失败" fi
containerd_status="${containerd_val:-连接失败}" kubectl_status="${kubectl_val:-连接失败}" fi
printf "%-20s %-15s %-10s %-12s %-12s %-10s\n" \ "${hostname}" "${ip}" "${role}" "${swap_status}" "${containerd_status}" "${kubectl_status}"
done < all_nodes.txt echo "" echo "================================================================================" echo "验收完成!" echo "================================================================================" # 给脚本添加执行权限 |

六、常见问题快速排查
|
问题现象 |
可能原因 |
解决方法 |
|
|
Swap 显示仍有空间 |
/etc/fstab 存在多个 Swap 条目 |
sudo sed -i '/swap/d' /etc/fstab && sudo swapoff -a |
|
|
containerd 服务启动失败 |
配置文件格式错误 |
重新生成默认配置:`sudo containerd config default |
sudo tee /etc/containerd/config.toml` |
|
crictl 连接失败 |
socket 文件权限问题 |
sudo chmod 666 /run/containerd/containerd.sock |
|
|
kubectl 连接集群超时 |
防火墙阻止 6443 端口 |
在 ControlNodeA 执行:sudo ufw allow 6443/tcp |
|
|
kubeconfig 权限报错 |
文件权限过大 |
chmod 600 /root/.kube/config |
总结
本文基于已完成的 SSH 免密环境,通过自动化脚本一次性完成了8 个节点的系统初始化、containerd 统一安装和 kubeconfig 批量分发。目前所有 9 个节点均已满足 Kubernetes 集群加入的全部前置条件,为下一步扩展为 3 控制平面 + 6 工作节点的高可用集群做好了充分准备。
本文为“搭建DevOps企业级仿真实验环境”系列的一部分,所有内容均基于实际硬件环境(32核64线程 / 128G内存 / 6T硬盘)编写,力求贴近真实企业部署场景。
欢迎各位 DevOps、SRE 爱好者,在评论区留言交流探讨,互相学习。
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐



所有评论(0)