1 概念

1.1 什么是 Prometheus

Prometheus 是一套集数据采集、时序存储、汇总分析、告警功能于一体的开源监控系统,专为分布式系统(尤其是云原生环境)设计。

1.2 特点

  1. 数据采集方式:通过 HTTP 协议采集数据(核心为 Pull 拉取模式);
  2. 存储能力:自带 TSDB(时序数据库),支持数据按时间顺序追加存储,也可集成 InfluxDB、OpenTSDB 等外置时序数据库;
  3. 查询能力:提供强大的查询功能,支持通过 PromQL 对时序数据进行过滤、聚合、分析;
  4. 可视化支持:可结合第三方插件 Grafana,以友好的图表形式展示监控数据;
  5. 灵活扩展:通过 Exporter、Pushgateway 等组件适配不同类型的监控目标。

1.3 核心组件

  1. Exporter:用于将原生不支持 Prometheus 协议的应用 / 系统指标,转换为 Prometheus 可识别的格式,暴露 /metrics 接口供采集(如 Node Exporter、MySQL Exporter);
  2. Pushgateway:针对短期临时任务(无法被 Prometheus 持续拉取),接收任务主动推送的指标数据,再由 Prometheus 从其拉取;
  3. TSDB 时序数据库:Prometheus 内置存储组件,专门用于高效存储按时间顺序生成的监控指标数据;
  4. Rule 告警规则:基于 PromQL 定义的监控条件(如 cpu_usage > 90%),用于判断指标是否触发告警;
  5. Alertmanager:告警通知模块,接收 Prometheus 触发的告警信息,经过去重、分组、路由后,向管理员发送通知(如邮件、Slack 等);
  6. Grafana:第三方可视化插件,支持接入 Prometheus 数据源,生成自定义监控仪表盘。

2 Prometheus 完整工作流程

 	主机/服务
 		▲
        │
        ├──采集指标
        │
 Exporter(Node Exporter、nginx-Exporter)
        ▲
        │
        │  Exporter提供 metrics接口暴露服务指标,供 Prometheus采集数据
        │
   Prometheus
        │
        ├── TSDB存储
        │
        ├── PromQL查询
        │        │
        │        ▼
        │      Grafana
        │
        └── Alert Rules
                 │
                 ▼
          	Alertmanager
                 │
                 ├── 告警
         ┌───────┼────────┐
         ▼       ▼        ▼
       Email    Slack     钉钉

在这里插入图片描述

2.1 Exporter——服务暴露指标

应用或系统需要提供 metrics接口

组件 作用
Node Exporter 主机CPU、内存、磁盘
kube-state-metrics Kubernetes资源状态
应用自带metrics 如SpringBoot

2.2 Prometheus 配置文件关键模块

prometheus.yml 是 Prometheus 的核心配置文件,主要由 全局配置、抓取配置、告警规则、告警管理器、远程存储 等模块组成,控制 数据采集、告警触发、数据存储与转发

global				# 全局配置
rule_files			# 告警规则文件
scrape_configs		# 监控目标
alerting			# Alertmanager地址
remote_write		# 远程写入
remote_read			# 远程读取
2.2.1 global——全局配置

🔹 作用:定义 Prometheus 的默认运行参数

常见参数

参数 作用
scrape_interval 抓取监控数据的默认间隔
scrape_timeout 抓取超时时间
evaluation_interval 规则计算间隔

示例

global:
  scrape_interval: 15s
  evaluation_interval: 15s

工作流程

Prometheus
     │
     │ 每15秒
     ▼
抓取监控目标 metrics

📌 说明

  • scrape_interval 可以被 job 覆盖
  • evaluation_interval 用于 告警规则计算
2.2.2 rule_files——告警规则配置

🔹 作用:加载告警规则和记录规则

groups
 ├─ name			# 规则组名
 ├─ rules			# 规则
 │   ├─ alert		# 告警规则注释
 │   ├─ expr		# 告警规则触发条件(#核心:promQL表达式)
 │   ├─ for			# 达到触发条件后持续多久触发告警
 │   ├─ labels		# 告警标签
 │   └─ annotations	# 告警通知的内容

示例

prometheus.yml:
  rule_files: |
    - rules.yml
rules.yml: |
  groups:
  - name: cpu告警
    rules:
    - alert: kube-proxy的cpu使用率大于80%
      expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80
      for: 1m
      labels:
        severity: warnning
      annotations:
        description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
2.2.3 scrape_configs——监控目标

🔹 作用:定义 Prometheus 要监控的目标

scrape_configs
   ├── job_name					# 监控任务名称
   ├── metrics_path				# 指定 metrics 地址
   ├── static_configs			# 静态目标
   ├── kubernetes_sd_configs	# K8s服务发现
   └── relabel_configs			# 标签重写

1️⃣ kubernetes_sd_configs(K8s服务发现)

role 监控对象
node Node
pod Pod
service Service
endpoints(推荐) Endpoint

示例

kubernetes_sd_configs:
- role: endpoints

📌 role: service 和 role: endpoints 的区别

假设:

Service
 nginx-service(10.96.10.20:80)
      				│
			 	┌───┴────┐
Pod1(10.244.1.3:80)     Pod2(10.244.2.5:80)

Service 监控,Prometheus 只看到1个目标,不能看到具体的资源

Prometheus target

10.96.10.20:80

Endpoints 监控,Prometheus 看到2个目标

Prometheus targets

10.244.1.3:80
10.244.2.5:80

**2️⃣ relabel_configs(标签重写 ⭐) **

🔹 作用:过滤或修改监控目标

常见用途:

  • 过滤 Pod
  • 修改标签
  • 生成 target

📌 常用的action(动作)

keep / drop      → 过滤 target
replace          → 修改/新增标签
labelmap         → 批量映射 k8s label 

📌 示例1:

# 含义:只保留 service_name = nginx-exporter 的 target


relabel_configs:

- source_labels: [__meta_kubernetes_service_name]
  action: keep
  regex: nginx-exporter

- source_labels: [__meta_kubernetes_namespace]
  action: replace
  target_label: k8s-namespace

- action: labelmap
  regex: __meta_kubernetes_service_label_(.+)

📌 示例2:

labelmap实际上就是把pod在k8s中的标签,映射给了Prometheus

在这里插入图片描述
在这里插入图片描述

2.3 Prometheus 发现监控目标

Prometheus需要知道 要监控谁,Prometheus监控的底层原理不是监控某个服务,而是监控Exporter,通过Exporter采集指标

配置文件:

prometheus.yml

2.2.1 传统环境部署的Prometheus监控nginx
scrape_configs:
  - job_name: nginx
    metrics_path: "/metrics"
    
    static_configs:
# //监控部署在nginx服务器上nginx-exporter 暴露出来的地址
    - targets:
      - 192.168.10.13:9913
      labels:
        service: nginx

2.2.1 k8s内部署的Prometheus监控nginx
scrape_configs:
- job_name: "nginx"
  kubernetes_sd_configs:
  - role: endpoints				# 监控的资源类型

  relabel_configs:
# //监控service_name为nginx-exporter,且endpoint_port_name(端口名)为metrics的资源
  - source_labels:
    - __meta_kubernetes_service_name
    action: keep
    regex: nginx-exporter
  - source_labels:
    - __meta_kubernetes_endpoint_port_name
    action: keep
    regex: metrics
-----------------------------------------------------------------------------------------
# 以下是部署在k8s中nginx-exporter的service的yaml文件,Prometheus实际上监控的就是这个
apiVersion: v1
kind: Service
metadata:
  name: nginx-exporter	# 监控service_name为nginx-exporter
  namespace: default
  labels:
    app: nginx-exporter
spec:
  selector:
    app: nginx
  ports:
  - name: metrics		# 且endpoint_port_name(端口名)为metrics的资源
    port: 9113
    targetPort: 9113

2.2.1 传统环境部署的Prometheus监控 k8s内部署的 nginx
scrape_configs:
- job_name: "kubernetes-nginx-service"

  kubernetes_sd_configs:
  - role: endpoints								# 监控的资源类型
    api_server: https://192.168.10.68:6443		# 先找到api_server

# //身份认证
  bearer_token_file: /data/token
  tls_config:
    ca_file: /data/ca.crt

  relabel_configs:
# //再找到nginx-exporer的地址
  - source_labels: [__meta_kubernetes_service_name]
    action: keep
    regex: nginx

  - source_labels: [__meta_kubernetes_endpoint_port_name]
    action: keep
    regex: metrics

总结:Prometheus定期抓取这些地址的metrics

  • Prometheus监控传统环境下的服务通过IP:端口来监控 exporter
  • Prometheus监控 k8s 内的服务通过 svc/endporter 来监控 exporter

2.3 TSDB——时序数据库

Prometheus把数据存入 本地时序数据库

特性 说明
TSDB 时序数据库
数据格式 时间 + 指标 + 标签
本地存储 /data目录
自动压缩 节省空间

数据结构:

metric_name{label=value} timestamp value

示例:

node_cpu_seconds_total{instance="node01"} 1710000000 0.23

2.4 PromQL 查询

📌 promQL的基本语法和格式

语法:函数(指标名 {标签过滤} [时间])

Prometheus 自定义指标常用函数(业务监控向)

函数 主要用途 示例
rate() 计算业务请求增长速率(QPS、下单量等) 以1个小时前的时间点为起点,计算5分钟内的增长速率rate(order_total[5m]offset 1h)
increase() 计算一段时间内数据增长量(下单总数、支付成功数) increase(payment_success_total[1h])
sum() 汇总多实例或多服务的指标 sum(rate(http_requests_total[5m])) by (service)
avg() 计算平均值(平均响应时间、平均耗时) avg(http_request_duration_seconds)
max() 检查业务峰值(最大并发、最大延迟) max(active_user_count)
histogram_quantile() 计算分位响应时间(P90 / P95 / P99) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
predict_linear() 预测业务量趋势或容量告警 predict_linear(order_total[1h], 3600)

📌 by 具有分别计算的功能

在这里插入图片描述

2.5 Alertmanager 告警

2.5.1 发送告警

Prometheus不会直接发通知,而是发送给Alertmanager。

prometheus.yml 配置

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

2.5.2 处理告警
alertmanager.yml
-----------------------------------------------------------------------------------------
route:
  receiver: email

receivers:
- name: email
  email_configs:
  - to: ops@example.com

2.6 Grafana 展示

浏览器访问http://192.168.10.68:30254 ,登陆 Grafana

2.6.1 配置Grafana UI 界面

选择 Add data source

【Name】设置成 Prometheus
【Type】选择 Prometheus
【URL】设置成 http://prometheus.prometheus.svc:9090

点击【Save & Test】完成设置
请添加图片描述
在这里插入图片描述

2.6.2 导入监控模板

官方链接搜索:https://grafana.com/dashboards?dataSource=prometheus&search=kubernetes

可导入 json文件,也可输入 ID号

点击左侧+号选择【Import】
点击【Upload .json File】导入 json 模板
【Prometheus】选择 Prometheus
点击【Import】

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

3 部署 Prometheus

3.1 镜像准备

# 镜像格式:harbor的IP/仓库名/镜像名:标签

# node-exporter
192.168.10.100/myharbor/prom/node-exporter:v0.16.0

# kube-state-metrics
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/quay.io/coreos/kube-state-metrics:v1.9.5

# Prometheus
192.168.10.100/myharbor/prometheus:v2.21.0

# Grafana
192.168.10.100/myharbor/grafana:v5.3.4

# Alertmanager
192.168.10.100/myharbor/alertmanager:v0.24.0

3.2 创建RBAC

# 创建目录
mkdir /opt/prometheus && cd /opt/prometheus/

vim prometheus-rbac.yaml
-----------------------------------------------------------------------------------------

---
# 1. 创建 Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: prometheus
  labels:
    name: prometheus
    app: prometheus
---
# 1. 创建 ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus-sa
  namespace: prometheus
  labels:
    app: prometheus
    component: rbac
---
# 2. 创建 ClusterRole(精细权限控制)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-clusterrole
  labels:
    app: prometheus
    component: rbac
rules:
# 核心监控权限 - 只读权限
- apiGroups: [""]
  resources:
    - nodes
    - nodes/proxy
    - nodes/metrics
    - services
    - endpoints
    - pods
    - namespaces
    - configmaps
    - replicationcontrollers
    - persistentvolumes
    - persistentvolumeclaims
  verbs: ["get", "list", "watch"]

# 扩展 API 组权限
- apiGroups: ["extensions", "apps", "networking.k8s.io"]
  resources:
    - deployments
    - daemonsets
    - replicasets
    - statefulsets
    - ingresses
  verbs: ["get", "list", "watch"]

# 批处理任务权限
- apiGroups: ["batch"]
  resources:
    - jobs
    - cronjobs
  verbs: ["get", "list", "watch"]

# 自动扩缩容相关
- apiGroups: ["autoscaling"]
  resources:
    - horizontalpodautoscalers
  verbs: ["get", "list", "watch"]

# 存储相关
- apiGroups: ["storage.k8s.io"]
  resources:
    - storageclasses
    - volumeattachments
  verbs: ["get", "list", "watch"]

# 指标相关(如果安装了 metrics-server)
- apiGroups: ["metrics.k8s.io"]
  resources:
    - pods
    - nodes
  verbs: ["get", "list", "watch"]

# 自定义资源监控
- apiGroups: ["apiextensions.k8s.io"]
  resources:
    - customresourcedefinitions
  verbs: ["get", "list", "watch"]

# Prometheus Operator 相关(如果使用)
- apiGroups: ["monitoring.coreos.com"]
  resources:
    - prometheuses
    - servicemonitors
    - prometheusrules
    - alertmanagers
  verbs: ["get", "list", "watch"]

# 事件监控权限
- apiGroups: [""]
  resources:
    - events
  verbs: ["get", "list", "watch", "create", "update", "patch"]

# 允许 Prometheus 进行健康检查
- apiGroups: [""]
  resources:
    - componentstatuses
  verbs: ["get", "list"]

# 允许访问 API 服务
- nonResourceURLs:
    - /metrics
    - /healthz
    - /ready
    - /live
  verbs: ["get"]
---
# 3. 创建 ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-clusterrolebinding
  labels:
    app: prometheus
    component: rbac
subjects:
- kind: ServiceAccount
  name: prometheus-sa
  namespace: prometheus
roleRef:
  kind: ClusterRole
  name: prometheus-clusterrole
  apiGroup: rbac.authorization.k8s.io
-----------------------------------------------------------------------------------------
kubectl apply -f prometheus-rbac.yaml

3.3 拉取 harbor 镜像仓库的 secret

# 创建连接harbor仓库的secret
kubectl create secret docker-registry harbor-secret \
  --docker-server=192.168.10.100 \
  --docker-username=harbor-secret \
  --docker-password=Harbor12345 \
  --docker-email=1194743653@qq.com \
  -n prometheus

# 给命名空间<promethues> 加入secret
kubectl patch serviceaccount default \
  -p '{"imagePullSecrets": [{"name": "harbor-secret"}]}' \
  -n prometheus

3.3 部署 node-exporter

# 部署 node-exporter
vim node-export.yaml
-----------------------------------------------------------------------------------------

---
apiVersion: apps/v1
kind: DaemonSet				#可以保证 k8s 集群的每个节点都运行完全一样的 pod
metadata:
  name: node-exporter
  namespace: prometheus
  labels:
    name: node-exporter
spec:
  selector:
    matchLabels:
     name: node-exporter
  template:
    metadata:
      labels:
        name: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
      - name: node-exporter
        image: 192.168.10.100/myharbor/prom/node-exporter:v0.16.0
        ports:
        - containerPort: 9100
        resources:
          requests:
            cpu: 0.15		#这个容器运行至少需要0.15核cpu
        securityContext:
          privileged: true	#开启特权模式
        args:
        - --path.procfs
        - /host/proc
        - --path.sysfs
        - /host/sys
        - --collector.filesystem.ignored-mount-points
        - '"^/(sys|proc|dev|host|etc)($|/)"'
        volumeMounts:
        - name: dev
          mountPath: /host/dev
        - name: proc
          mountPath: /host/proc
        - name: sys
          mountPath: /host/sys
        - name: rootfs
          mountPath: /rootfs
      tolerations:
      - key: "node-role.kubernetes.io/control-plane"
        operator: "Exists"
        effect: "NoSchedule"
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: dev
          hostPath:
            path: /dev
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /
-----------------------------------------------------------------------------------------

kubectl apply -f node-export.yaml

curl -Ls http://192.168.10.80:9100/metrics | grep node_cpu_seconds

curl -Ls http://192.168.10.80:9100/metrics | grep node_load

3.4 部署 kube-state-metrics 组件

# 创建 sa,并对 sa 授权
vim kube-state-metrics-rbac.yaml 
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources: ["daemonsets", "deployments", "replicasets"]
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources: ["cronjobs", "jobs"]
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources: ["horizontalpodautoscalers"]
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/quay.io/coreos/kube-state-metrics:v1.9.5
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  name: kube-state-metrics
  namespace: kube-system
  labels:
    app: kube-state-metrics
spec:
  ports:
  - name: kube-state-metrics
    port: 8080
    protocol: TCP
  selector:
    app: kube-state-metrics
-----------------------------------------------------------------------------------------

kubectl apply -f kube-state-metrics-rbac.yaml

3.4 部署 Configmap 存放配置

# # 创建 Prometheus的配置文件
vim prometheus-configmap.yaml
-----------------------------------------------------------------------------------------
kind: ConfigMap
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus-configmap
  namespace: prometheus
data:
  prometheus.yml: |
    rule_files:      #指定报警规则文,prometheus根据这些规则信息,会推送报警信息到alertmanager中
     - "rules.yml"
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["localhost:9093"]
    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 1m
    scrape_configs:
    - job_name: 'kubernetes-node'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
    - job_name: 'kubernetes-node-cadvisor'
      kubernetes_sd_configs:
      - role:  node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    - job_name: 'kubernetes-apiserver'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name 
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name
    - job_name: 'kubernetes-schedule'       #监控作业的名称
      scrape_interval: 5s                   #拉取数据的时间间隔,默认为 1 分钟默认继承 global 值
      static_configs:                       #表示静态目标配置,就是固定从某个target拉取数据
      - targets: ['192.168.10.66:10251']    #指定监控的目标,表示从哪儿拉取数据
    - job_name: 'kubernetes-controller-manager'
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.10.66:10252']
    - job_name: 'kubernetes-kube-proxy'
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.10.66:10249','192.168.10.67:10249','192.168.10.68:10249']
    - job_name: 'kubernetes-etcd'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt
        cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt
        key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key
      scrape_interval: 5s
      static_configs:
      - targets: ['192.168.10.66:2379']
  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: kube-proxy的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  kube-proxy的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: scheduler的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  scheduler的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: controller-manager的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  controller-manager的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 0
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: apiserver的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  apiserver的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: etcd的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
      - alert:  etcd的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
      - alert: kube-state-metrics的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"
          value: "{{ $value }}%"
          threshold: "80%"      
      - alert: kube-state-metrics的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"
          value: "{{ $value }}%"
          threshold: "90%"      
      - alert: coredns的cpu使用率大于80%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80
        for: 2s
        labels:
          severity: warnning
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"
          value: "{{ $value }}%"
          threshold: "80%"      
      - alert: coredns的cpu使用率大于90%
        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90
        for: 2s
        labels:
          severity: critical
        annotations:
          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"
          value: "{{ $value }}%"
          threshold: "90%"      
      - alert: HttpRequestsAvg
        expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m]))  > 1000
        for: 2s
        labels:
          team: admin
        annotations:
          description: "组件{{$labels.job}}({{$labels.instance}}): TPS超过1000"
          value: "{{ $value }}"
          threshold: "1000"   
=========================================================================================

# 创建 alermanager的配置文件

vim alertmanager-configmap.yaml
-----------------------------------------------------------------------------------------
kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager-configmap
  namespace: prometheus
data:
  alertmanager.yml: |-
    global:                    #设置发件人邮箱信息
      resolve_timeout: 1m
      smtp_smarthost: 'smtp.qq.com:465'  改成465 587 ssl
      smtp_from: '你的邮箱'
      smtp_auth_username: '你的邮箱'
      smtp_auth_password: '授权码'    #此处为授权码,登录QQ邮箱【设置】->【账户】中的【生成授权码】获取
      smtp_require_tls: false
    route:                        #用于设置告警的分发策略
      group_by: [alertname]         #采用哪个标签来作为分组依据
      group_wait: 10s               #组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出
      group_interval: 10s           #上下两组发送告警的间隔时间
      repeat_interval: 10m          #重复发送告警的时间,减少相同邮件的发送频率,默认是1h
      receiver: default-receiver    #定义谁来收告警
    receivers:                    #设置收件人邮箱信息
    - name: 'default-receiver'
      email_configs:
      - to: '你的邮箱'    #设置收件人邮箱地址
        send_resolved: true
=========================================================================================

kubectl apply -f prometheus-configmap.yaml
kubectl apply -f alertmanager-configmap.yaml

3.5 部署 Prometheus + Alertmanager

(3)通过 deployment 部署 prometheus
#将 prometheus 调度到 node1 节点,在 node1 节点创建 prometheus 数据存储目录
mkdir /data && chmod 777 /data

#通过 deployment 部署 prometheus
vim prometheus-alertmanager-server.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: prometheus
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: server
    #matchExpressions:
    #- {key: app, operator: In, values: [prometheus]}
    #- {key: component, operator: In, values: [server]}
  template:
    metadata:
      labels:
        app: prometheus
        component: server
      annotations:
        prometheus.io/scrape: 'false'
    spec:
      nodeName: node01
      serviceAccountName: prometheus-sa
      containers:
      - name: prometheus
        image: 192.168.10.100/myharbor/prometheus:v2.21.0
        imagePullPolicy: IfNotPresent
        securityContext:  # 添加安全上下文
          runAsUser: 0  # 以 root 用户运行
          runAsGroup: 0
        command:
        - "/bin/prometheus"
        args:
        - "--config.file=/etc/prometheus/prometheus.yml"
        - "--storage.tsdb.path=/prometheus"
        - "--storage.tsdb.retention=24h"
        - "--web.enable-lifecycle"
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/prometheus
          name: prometheus-configmap
        - mountPath: /prometheus/
          name: prometheus-storage-volume
        - name: k8s-certs
          mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/
        - name: localtime
          mountPath: /etc/localtime
      - name: alertmanager
        image: 192.168.10.100/myharbor/alertmanager:v0.24.0
        imagePullPolicy: IfNotPresent
        args:
        - "--config.file=/etc/alertmanager/alertmanager.yml"
        - "--log.level=debug"
        - "--storage.path=/alertmanager/data"
        ports:
        - containerPort: 9093
          protocol: TCP
          name: alertmanager
        volumeMounts:
        - name: alertmanager-configmap
          mountPath: /etc/alertmanager
        - name: alertmanager-storage
          mountPath: /alertmanager
        - name: localtime
          mountPath: /etc/localtime
      volumes:
        - name: prometheus-configmap
          configMap:
            name: prometheus-configmap
        - name: prometheus-storage-volume
          hostPath:
           path: /data
           type: Directory
        - name: k8s-certs
          secret:
           secretName: etcd-certs
        - name: alertmanager-configmap
          configMap:
            name: alertmanager-configmap
        - name: alertmanager-storage
          hostPath:
           path: /data/alertmanager
           type: DirectoryOrCreate
        - name: localtime
          hostPath:
           path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: prometheus
  labels:
    app: prometheus
spec:
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
      protocol: TCP
      nodePort: 31000
  selector:
    app: prometheus
    component: server
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: prometheus
  labels:
    name: prometheus
    kubernetes.io/cluster-service: 'true'
spec:
  ports:
  - name: alertmanager
    nodePort: 30066
    port: 9093
    protocol: TCP
    targetPort: 9093
  selector:
    app: prometheus
  sessionAffinity: None
  type: NodePort
=========================================================================================

kubectl apply -f prometheus-alertmanager-server.yaml

# 浏览器访问prometheus: http://192.168.10.68:31000
# 浏览器访问alertmanager: http://192.168.10.80:30066

3.6 部署 Grafana

vim grafana.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: monitoring-grafana
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      task: monitoring
      k8s-app: grafana
  template:
    metadata:
      labels:
        task: monitoring
        k8s-app: grafana
    spec:
      containers:
      - name: grafana
        image: 192.168.10.100/myharbor/grafana:v5.3.4
        ports:
        - containerPort: 3000
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/ssl/certs
          name: ca-certificates
          readOnly: true
        - mountPath: /var/lib/grafana
          name: grafana-storage
        env:
        - name: INFLUXDB_HOST
          value: monitoring-influxdb
        - name: GF_SERVER_HTTP_PORT
          value: "3000"
          # The following env variables are required to make Grafana accessible via
          # the kubernetes api-server proxy. On production clusters, we recommend
          # removing these env variables, setup auth for grafana, and expose the grafana
          # service using a LoadBalancer or a public IP.
        - name: GF_AUTH_BASIC_ENABLED
          value: "false"
        - name: GF_AUTH_ANONYMOUS_ENABLED
          value: "true"
        - name: GF_AUTH_ANONYMOUS_ORG_ROLE
          value: Admin
        - name: GF_SERVER_ROOT_URL
          # If you're only using the API Server proxy, set this value instead:
          # value: /api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
          value: /
      volumes:
      - name: ca-certificates
        hostPath:
          path: /etc/ssl/certs
      - name: grafana-storage
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  labels:
    # For use as a Cluster add-on (https://github.com/kubernetes/kubernetes/tree/master/cluster/addons)
    # If you are NOT using this as an addon, you should comment out this line.
    kubernetes.io/cluster-service: 'true'
    kubernetes.io/name: monitoring-grafana
  name: monitoring-grafana
  namespace: kube-system
spec:
  # In a production setup, we recommend accessing Grafana through an external Loadbalancer
  # or through a public IP.
  # type: LoadBalancer
  # You could also use NodePort to expose the service at a randomly-generated port
  # type: NodePort
  ports:
  - port: 80
    targetPort: 3000
  selector:
    k8s-app: grafana
  type: NodePort
-----------------------------------------------------------------------------------------

kubectl apply -f grafana.yaml

kubectl get svc -n kube-system | grep grafana

3.7 Grafana UI 配置

详见 2.6

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐