从零搭建 K8s 集群 + Prometheus 监控 + Harbor 私有仓库 + 钉钉告警(RHEL 9 实战)
记录一下最近在 RHEL 9 环境下从零搭建 K8s 集群、Prometheus 监控体系、Harbor 私有镜像仓库的完整过程,踩了不少坑,全部记录下来。
环境说明
| 主机名 | IP | 角色 | 系统 |
|---|---|---|---|
| Ubuntu | 192.168.137.128 | Prometheus + Grafana + Alertmanager | Ubuntu 22.04 |
| k8s-master | 192.168.137.100 | K8s Master | RHEL 9.5 |
| k8s-node1 | 192.168.137.101 | K8s Node | RHEL 9.5 |
| k8s-node2 | 192.168.137.102 | K8s Node | RHEL 9.5 |
| k8s-node3 | 192.168.137.103 | K8s Node | RHEL 9.5 |
| k8s-devops | 192.168.137.104 | Harbor 镜像仓库 | RHEL 9.5 |
整体架构
Ubuntu 监控主机
Prometheus(:9090) ──→ Grafana(:3000)
Alertmanager(:9093) ──→ 钉钉群
│
│ 采集指标
▼
K8s 集群 (master + 3 node)
nginx 微服务(3副本)
Flannel 网络
│
│ 拉镜像
▼
Harbor 私有仓库 (devops)
第二段(RHEL 基础环境):
一、RHEL 9 基础环境准备
RHEL 9.5 没有注册订阅的话 yum 源是空的,第一步要解决这个问题。
1.1 配置 CentOS Stream 9 阿里云镜像源
rm -f /etc/yum.repos.d/*.repo
cat > /etc/yum.repos.d/centos.repo << 'EOF'
[baseos]
name=CentOS Stream 9 - BaseOS
baseurl=https://mirrors.aliyun.com/centos-stream/9-stream/BaseOS/x86_64/os/
gpgcheck=0
enabled=1
[appstream]
name=CentOS Stream 9 - AppStream
baseurl=https://mirrors.aliyun.com/centos-stream/9-stream/AppStream/x86_64/os/
gpgcheck=0
enabled=1
EOF
yum clean all && yum makecache
1.2 关闭防火墙、SELinux、swap
systemctl stop firewalld && systemctl disable firewalld
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
swapoff -a
sed -i '/swap/d' /etc/fstab
1.3 配置 hosts
cat >> /etc/hosts << 'EOF'
192.168.137.100 k8s-master
192.168.137.101 k8s-node1
192.168.137.102 k8s-node2
192.168.137.103 k8s-node3
192.168.137.104 k8s-devops
EOF
1.4 加载内核模块和参数
cat > /etc/modules-load.d/k8s.conf << 'EOF'
overlay
br_netfilter
EOF
modprobe overlay && modprobe br_netfilter
cat > /etc/sysctl.d/k8s.conf << 'EOF'
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system
1.5 安装 Docker
yum install -y yum-utils
yum-config-manager --add-repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
yum install -y docker-ce docker-ce-cli containerd.io
mkdir -p /etc/docker
cat > /etc/docker/daemon.json << 'EOF'
{
"registry-mirrors": ["https://mirror.ccs.tencentyun.com"],
"exec-opts": ["native.cgroupdriver=systemd"],
"storage-driver": "overlay2"
}
EOF
systemctl daemon-reload && systemctl enable docker && systemctl restart docker

📸 截图:docker version 输出
以上步骤所有 K8s 节点都要执行。我写了 Shell 脚本通过 SSH 批量执行,一次搞定 5 台机器。
第三段(K8s 集群搭建):
二、搭建 K8s 集群
2.1 安装 kubeadm(master + 3 node)
cat > /etc/yum.repos.d/kubernetes.repo << 'EOF'
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes-new/core/stable/v1.30/rpm/
gpgcheck=0
enabled=1
EOF
yum makecache
yum install -y kubelet kubeadm kubectl
2.2 配置 containerd
这一步很关键,不配的话 kubeadm init 会报 required cgroups disabled。
containerd config default > /etc/containerd/config.toml
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sed -i 's|sandbox_image = "registry.k8s.io/pause:3.10.1"|sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.10"|' /etc/containerd/config.toml
systemctl restart containerd
systemctl enable kubelet
提前拉取 pause 镜像并打标签:
crictl pull registry.aliyuncs.com/google_containers/pause:3.10
ctr -n k8s.io images tag registry.aliyuncs.com/google_containers/pause:3.10 registry.k8s.io/pause:3.10.1
2.3 初始化 Master
kubeadm init \
--apiserver-advertise-address=192.168.137.100 \
--image-repository=registry.aliyuncs.com/google_containers \
--kubernetes-version=v1.30.14 \
--service-cidr=10.96.0.0/12 \
--pod-network-cidr=10.244.0.0/16
配置 kubectl:
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
2.4 安装 Flannel 网络插件
kubectl apply -f https://cdn.jsdelivr.net/gh/flannel-io/flannel@master/Documentation/kube-flannel.yml

📸 截图:kubectl get nodes 显示 master Ready
2.5 Node 加入集群
每个 node 先配置 containerd 和 pause 镜像(同 2.2),然后执行:
kubeadm join 192.168.137.100:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

📸 截图:kubectl get nodes 显示 4 个节点全部 Ready
第四段(微服务部署+Harbor):
三、部署微服务应用
kubectl create namespace demo-app
kubectl create deployment nginx-web --image=192.168.137.104/library/nginx:1.25 --replicas=3 -n demo-app
kubectl expose deployment nginx-web --type=NodePort --port=80 --target-port=80 -n demo-app
kubectl get pods -n demo-app -o wide
kubectl get svc -n demo-app

📸 截图:3 个 Pod 全部 Running

📸 截图:浏览器显示 Nginx 欢迎页
四、搭建 Harbor 私有镜像仓库
4.1 安装 Docker Compose
curl -L "https://ghfast.top/https://github.com/docker/compose/releases/download/v2.27.0/docker-compose-linux-x86_64" -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
4.2 下载安装 Harbor
cd /opt
wget https://ghfast.top/https://github.com/goharbor/harbor/releases/download/v2.11.0/harbor-offline-installer-v2.11.0.tgz
tar xf harbor-offline-installer-v2.11.0.tgz
cd harbor
cp harbor.yml.tmpl harbor.yml
sed -i 's/hostname: reg.mydomain.com/hostname: 192.168.137.104/' harbor.yml
sed -i 's/^https:/#https:/' harbor.yml
sed -i 's/^ port: 443/# port: 443/' harbor.yml
sed -i 's/^ certificate:/# certificate:/' harbor.yml
sed -i 's/^ private_key:/# private_key:/' harbor.yml
./install.sh

📸 截图位置:install.sh 执行完成,所有容器 Started
4.3 推送镜像到 Harbor
docker login 192.168.137.104 -u admin -p Harbor12345
docker tag nginx:1.25 192.168.137.104/library/nginx:1.25
docker push 192.168.137.104/library/nginx:1.25

📸 截图:Harbor Web 界面显示 nginx 镜像
4.4 K8s 对接 Harbor
每个 K8s 节点配置 containerd 信任 Harbor:
mkdir -p /etc/containerd/certs.d/192.168.137.104
cat > /etc/containerd/certs.d/192.168.137.104/hosts.toml << 'EOF'
server = "http://192.168.137.104"
[host."http://192.168.137.104"]
capabilities = ["pull", "resolve"]
skip_verify = true
EOF
systemctl restart containerd
更新 Deployment:
kubectl set image deployment/nginx-web nginx=192.168.137.104/library/nginx:1.25 -n demo-app
第五段(Prometheus 监控+告警+踩坑+命令速查):
五、Prometheus 监控体系
5.1 核心配置 prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'ubuntu-monitor'
static_configs:
- targets: ['localhost:9100']
- job_name: 'redhat-servers'
static_configs:
- targets:
- '192.168.137.100:9100'
- '192.168.137.101:9100'
- '192.168.137.102:9100'
- '192.168.137.103:9100'
- '192.168.137.104:9100'
- job_name: 'k8s-nodes'
static_configs:
- targets:
- '192.168.137.100:10250'
- '192.168.137.101:10250'
- '192.168.137.102:10250'
- '192.168.137.103:10250'
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s-token
- job_name: 'k8s-apiserver'
static_configs:
- targets: ['192.168.137.100:6443']
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s-token
5.2 告警规则
groups:
- name: host_alerts
rules:
- alert: HostDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "主机 {{ $labels.instance }} 宕机"
- alert: HighCPU
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} CPU超过85%"
- alert: HighMemory
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} 内存超过90%"
- alert: DiskFull
expr: (1 - node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} 磁盘超过85%"
5.3 Grafana 配置
apt-get install -y musl
wget https://mirrors.tuna.tsinghua.edu.cn/grafana/apt/pool/main/g/grafana/grafana_11.1.0_amd64.deb
dpkg -i grafana_11.1.0_amd64.deb
systemctl enable grafana-server && systemctl start grafana-server
- 浏览器打开 http://192.168.137.128:3000,admin/admin 登录
- Connections → Data Sources → Add → Prometheus → URL 填 http://localhost:9090 → Save & Test
- Dashboards → Import → 输入 1860 → Load → Import

📸 截图:Prometheus Targets 页面所有节点 UP

📸 截图:Grafana 面板显示 K8s 节点监控数据
5.4 钉钉告警
Alertmanager 配置:
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'dingtalk'
receivers:
- name: 'dingtalk'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/webhook1/send'
send_resolved: true

📸 截图:钉钉群收到 HostDown 告警消息
六、踩坑记录
| 问题 | 原因 | 解决方案 |
|---|---|---|
| RHEL 9 无 yum 源 | 未注册订阅 | 配置 CentOS Stream 9 阿里云镜像源 |
| kubeadm init 报 cgroups disabled | containerd 未配置 SystemdCgroup | 修改 config.toml 设为 true |
| pause 镜像拉取失败 | registry.k8s.io 不可达 | 阿里云拉取 + ctr images tag 改名 |
| Docker Hub 被墙 | 国内网络限制 | Ubuntu 加速器拉取 → docker save → scp → ctr import |
| Harbor 重启后服务停止 | 重启 Docker 停掉所有容器 | docker-compose up -d |
| DNS 解析失败 | resolv.conf 配置错误 | 配置阿里云 DNS 223.5.5.5 |
| Windows 脚本报错 | 换行符 \r\n | dos2unix 转换 |
七、常用命令速查
kubectl get nodes
kubectl get pods -n <ns> -o wide
kubectl get svc -n <ns>
kubectl create namespace <name>
kubectl create deployment <name> --image=<img> --replicas=<n> -n <ns>
kubectl expose deployment <name> --type=NodePort --port=80 -n <ns>
kubectl set image deployment/<name> <container>=<new-image> -n <ns>
kubectl delete pods --all -n <ns>
kubectl logs <pod> -n <ns>
kubectl describe pod <pod> -n <ns>
docker pull / tag / push / save / load
docker login <harbor-ip> -u admin -p Harbor12345
crictl pull <image>
ctr -n k8s.io images import <file.tar>
ctr -n k8s.io images tag <old> <new>
systemctl start/stop/restart/status/enable <service>
promtool check config /etc/prometheus/prometheus.yml
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐


所有评论(0)