前言

💡 痛点: 微服务出问题了但不知道根因在哪?Metrics/Logs/Traces 分散在各个系统,排查一次故障要翻好几个 Dashboard?新服务上线没有标准化的可观测规范?

🎯 解决方案: 基于 OpenTelemetry 统一数据采集,通过 Prometheus + Loki + Tempo 存储,Grafana 可视化告警,构建一站式可观测平台。

可视化层

存储层

采集层

数据源

应用服务
Metrics/Logs/Traces

K8s 集群
资源/事件

基础设施
主机/网络

OpenTelemetry SDK
自动/手动插桩

OTLP Exporter

Prometheus
时序指标

Loki
日志聚合

Tempo
分布式追踪

Grafana
Dashboard/Alert


一、OpenTelemetry 核心概念

1.1 三大信号体系

信号 用途 典型指标 存储引擎
Metrics 聚合数值(QPS/Latency/CPU) Counter/Gauge/Histogram Prometheus
Logs 离散事件记录 Error/Warn/Info Loki/Elasticsearch
Traces 请求全链路追踪 Span/Trace Tempo/Jaeger

1.2 OpenTelemetry 数据模型

# OpenTelemetry 核心概念
class Signal:
    """
    - Metrics: Counter, UpDownCounter, Histogram, ObservableGauge
    - Logs: LogRecord (Body, Attributes, Severity)
    - Traces: Tracer -> Span -> SpanContext (TraceID, SpanID)
    """
    pass

二、Python 应用接入 OpenTelemetry

2.1 安装依赖

pip install opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-flask \
  opentelemetry-instrumentation-requests \
  prometheus-client \
  python-logstash

2.2 基础初始化

# otel_init.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.logging import LoggingHandler
from opentelemetry.sdk._logs import LoggerProvider, LogRecordProcessor
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor, OTLPLogRecordExporter
import logging

# 服务资源信息
resource = Resource.create({
    SERVICE_NAME: "user-service",
    "service.version": "1.0.0",
    "deployment.environment": "production",
    "host.name": "prod-user-svc-01",
})

# Trace 初始化
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://tempo:4317", insecure=True))
)
trace.set_tracer_provider(trace_provider)

# Metrics 初始化
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://prometheus:4317", insecure=True),
    export_interval_millis=10000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

# Logs 初始化
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(
    BatchLogRecordProcessor(OTLPLogRecordExporter(endpoint="http://loki:4317", insecure=True))
)

# 配置 Python logging
logging.basicConfig(level=logging.INFO)
logging.getLogger().addHandler(LoggingHandler(level=logging.INFO))

2.3 手动插桩示例

# instrument.py
from opentelemetry import trace, metrics
from opentelemetry.trace import Status, StatusCode
import time
import random

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# 定义 Metrics
request_counter = meter.create_counter(
    name="http_requests_total",
    description="Total HTTP requests",
    unit="1",
)

latency_histogram = meter.create_histogram(
    name="http_request_duration_seconds",
    description="HTTP request latency",
    unit="s",
)

active_users_gauge = meter.create_up_down_counter(
    name="active_users_count",
    description="Number of active users",
)

# 业务 Span 插桩
@tracer.start_as_current_span("process_user_request")
def process_user_request(user_id: str, action: str):
    span = trace.get_current_span()
    span.set_attribute("user.id", user_id)
    span.set_attribute("user.action", action)
    span.set_attribute("request.timestamp", int(time.time()))

    try:
        # 业务逻辑
        start = time.perf_counter()
        result = db.query_user(user_id)
        duration = time.perf_counter() - start

        # 记录 Metrics
        request_counter.add(1, {"action": action, "status": "success"})
        latency_histogram.record(duration, {"endpoint": "/users/{id}", "method": "GET"})

        span.set_status(Status(StatusCode.OK))
        return result

    except Exception as e:
        request_counter.add(1, {"action": action, "status": "error"})
        span.record_exception(e)
        span.set_status(Status(StatusCode.ERROR, str(e)))
        raise

2.4 Flask 自动插桩

# app.py
from flask import Flask, jsonify
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

app = Flask(__name__)

# 自动插桩
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# 自定义中间件记录业务指标
@app.before_request
def before_request():
    from opentelemetry import trace
    span = trace.get_current_span()
    span.set_attribute("http.request.id", request.headers.get("X-Request-ID", ""))

@app.route("/health")
def health():
    return jsonify({"status": "healthy"})

@app.route("/users/<user_id>")
def get_user(user_id):
    with trace.get_tracer(__name__).start_as_current_span("get_user") as span:
        span.set_attribute("user.id", user_id)
        # ... 业务逻辑
        return jsonify({"user_id": user_id, "name": "Test User"})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080)

三、K8s 中的可观测组件部署

3.1 Prometheus + Grafana 部署

# prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: 50GB
    replicas: 2
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2
        memory: 8Gi

  ruleSelector: {}
  ruleSelectorMatchnames: ["monitoring-stack"]

  # ServiceMonitor 自动服务发现
  serviceMonitorSelector:
    matchLabels:
      release: prometheus

  # 存储配置
  storageSpec:
    volumeClaimTemplate:
      spec:
        storageClassName: "ssd-storageclass"
        resources:
          requests:
            storage: 100Gi

alertmanager:
  config:
    global:
      smtp_smarthost: "smtp.example.com:587"
      smtp_from: "alert@example.com"
    route:
      group_by: ["alertname", "cluster"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: "team-notifications"
    receivers:
      - name: "team-notifications"
        email_configs:
          - to: "oncall@example.com"

# 安装
helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  --create-namespace \
  -f prometheus-values.yaml

3.2 Loki 部署

# loki-values.yaml
loki:
  auth_enabled: false

  commonConfig:
    replication_factor: 3

  storage:
    type: s3
    s3:
      endpoint: "http://minio.monitoring:9000"
      bucketnames: "loki-chunks"
      insecure: true

  limits_config:
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    ingestion_rate_mb: 50
    ingestion_burst_size_mb: 100

  schema_config:
    configs:
      - from: "2024-01-01"
        store: boltdb-shipper
        object_store: s3
        schema: v12
        index:
          prefix: loki_index_
          period: 24h

  querier:
    max_concurrent: 16
    query_timeout: 3m

  query_scheduler:
    max_outstanding_requests_per_tenant: 256

# 单副本开发环境配置
singleBinary:
  replicas: 1
  resources:
    requests:
      cpu: 100m
      memory: 256Mi

# 安装
helm install loki grafana/loki \
  -n monitoring \
  --create-namespace \
  -f loki-values.yaml

3.3 Tempo 分布式追踪部署

# tempo-values.yaml
tempo:
  replicas: 3

  storage:
    trace:
      backend: s3
      s3:
        endpoint: "http://minio.monitoring:9000"
        bucket: tempo-traces
        insecure: true

  configuration:
    server:
      http_listen_port: 3100
      grpc_listen_port: 9095

    distributor:
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318

    ingester:
      max_block_duration: 5m

    compactor:
      compaction:
        block_retention: 48h

    storage:
      trace:
        backend: s3
        s3:
          endpoint: "http://minio.monitoring:9000"
          bucket: tempo-traces

# 安装
helm install tempo grafana/tempo \
  -n monitoring \
  --create-namespace \
  -f tempo-values.yaml

3.4 Grafana 配置多数据源

# grafana-datasources.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  datasources.yaml: |
    apiVersion: 1

    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus.monitoring:9090
        isDefault: true
        jsonData:
          timeInterval: 15s
          httpMethod: POST

      - name: Loki
        type: loki
        access: proxy
        url: http://loki.monitoring:3100
        jsonData:
          derivedFields:
            - name: TraceID
              matcherRegex: "trace_id=(\\w+)"
              url: "$${__value.raw}"
              datasourceUid: Tempo

      - name: Tempo
        type: tempo
        access: proxy
        url: http://tempo.monitoring:3100
        jsonData:
          serviceMap:
            datasourceUid: Prometheus
          nodeGraph:
            enabled: true
          search:
            hide: false
          lokiSearch:
            datasourceUid: Loki

四、Grafana Dashboard 实战

4.1 基础设施 Dashboard

{
  "title": "Kubernetes 集群概览",
  "panels": [
    {
      "title": "CPU 使用率",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
      "targets": [
        {
          "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (namespace) / sum(kube_pod_container_resource_limits{resource=\"cpu\"}) by (namespace)",
          "legendFormat": "{{namespace}}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "max": 1,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {"value": 0, "color": "green"},
              {"value": 0.7, "color": "yellow"},
              {"value": 0.85, "color": "red"}
            ]
          }
        }
      }
    },
    {
      "title": "内存使用率",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
      "targets": [
        {
          "expr": "sum(rate(container_memory_working_set_bytes[5m])) by (namespace) / sum(container_spec_memory_limit_bytes > 0) by (namespace)",
          "legendFormat": "{{namespace}}"
        }
      ]
    },
    {
      "title": "Pod 数量",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 0, "y": 8},
      "targets": [
        {
          "expr": "count(kube_pod_info) by (namespace)"
        }
      ]
    },
    {
      "title": "Pod 重启次数(24h)",
      "type": "table",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
      "targets": [
        {
          "expr": "sum(increase(kube_pod_container_status_restarts_total[24h])) by (namespace, pod)",
          "format": "table"
        }
      ]
    }
  ]
}

4.2 应用性能 Dashboard

# prometheus_app_metrics.py
"""
应用性能关键指标 PromQL

RED 方法论:
- Rate: 请求速率
- Error: 错误率
- Duration: 延迟分布
"""

# 请求量 (QPS)
rate_http_requests_total{service="user-service"}[5m]

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# P50/P90/P99 延迟
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# 活跃请求数
sum(rate(http_requests_in_flight[1m])) by (service)

# 数据库连接池使用率
db_pool_connections_used / db_pool_connections_max

4.3 链路追踪 Dashboard

# Tempo 查询示例

# 查找慢请求 Trace
{ service="user-service" } | span dur > 1s

# 查找错误 Trace
{ service="order-service" } | status = STATUS_CODE_ERROR

# 按服务聚合错误数
sum(rate(tempo_spans_total{status_code="STATUS_CODE_ERROR"}[5m])) by (service_name)

# 服务依赖关系
sum(rate(tempo_spans_total[5m])) by (service_name, span_name)

五、告警规则设计

5.1 Prometheus 告警规则

# prometheus-alerts.yaml
groups:
  - name: infrastructure
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: |
          (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace, pod)
          / sum(container_spec_cpu_quota/container_spec_cpu_period) by (namespace, pod)) > 0.85
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU 使用率过高"
          description: "CPU 使用率 {{ $value | humanizePercentage }} 超过 85%"

      - alert: PodMemoryUsageHigh
        expr: |
          (sum(container_memory_working_set_bytes{container!=""}) by (namespace, pod)
          / sum(container_spec_memory_limit_bytes > 0) by (namespace, pod)) > 0.9
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 内存使用率超 90%"

      - alert: PodRestartingTooMuch
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 过去 1 小时重启 {{ $value }} 次"

      - alert: PVCAvailableSpaceLow
        expr: |
          (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.persistentvolumeclaim }} 可用空间不足 15%"

  - name: application
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, namespace)
            / sum(rate(http_requests_total[5m])) by (service, namespace)
          ) > 0.05
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "服务 {{ $labels.service }} 错误率 {{ $value | humanizePercentage }} 超过 5%"
          runbook_url: "https://wiki.example.com/runbook/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service, namespace)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "服务 {{ $labels.service }} P99 延迟 {{ $value }}s 超过 2s"

      - alert: ServiceDown
        expr: |
          up{job="user-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务 {{ $labels.job }} 不可用"

  - name: business
    interval: 60s
    rules:
      - alert: RevenueAnomaly
        expr: |
          abs(
            rate(payment_amount_total[5m]) -
            avg_over_time(rate(payment_amount_total[5m])[7d:5m])
          ) > 3 * stddev_over_time(rate(payment_amount_total[5m])[7d:5m])
        for: 10m
        labels:
          severity: critical
          team: business
        annotations:
          summary: "支付金额异常,偏离历史均值超过 3σ"

5.2 告警路由配置

# alertmanager-config.yaml
route:
  group_by: ["alertname", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default"

  routes:
    # 基础设施告警 -> 平台团队
    - match:
        team: platform
      receiver: "platform-team"
      group_wait: 30s
      repeat_interval: 1h

    # 严重告警 -> 立即通知
    - match:
        severity: critical
      receiver: "pagerduty"
      continue: true

    # 业务告警 -> 业务团队
    - match:
        team: business
      receiver: "business-team"
      routes:
        - match:
            alertname: "RevenueAnomaly"
          receiver: "finance-slack"
          group_wait: 0s

    # 测试环境告警 -> 仅记录
    - match:
        environment: test
      receiver: "null"
      group_wait: 10m

receivers:
  - name: "default"
    email_configs:
      - to: "oncall@example.com"
        send_resolved: true

  - name: "platform-team"
    slack_configs:
      - channel: "#platform-alerts"
        api_url: "https://hooks.slack.com/services/xxx"
        title: "{{ .CommonLabels.alertname }}"
        text: |
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          Labels: {{ .Labels }}
          {{ end }}
        send_resolved: true

  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "PAGERDUTY_INTEGRATION_KEY"
        severity: critical
        component: "monitoring"
        class: "availability"

  - name: "business-team"
    email_configs:
      - to: "business@example.com"

  - name: "null"
    # 空接收器,丢弃告警

六、日志聚合与查询

6.1 Loki 日志查询

# 基础查询

# 某服务所有日志
{service="user-service"}

# 带标签过滤
{service="user-service", namespace="production"} |= "ERROR"

# 多条件过滤
{service="order-service"} |= "timeout" | json | duration > 5

# 正则匹配
{service="api-gateway"} |~ "status=(4\d{2}|5\d{2})"

# 聚合统计
# 每分钟错误数
sum(
  count_over_time(
    {service="payment-service"} | json | status="error"[5m]
  )
) by (service)

# P99 日志延迟
quantile_over_time(0.99,
  {service="db-query"} | json | latency[5m]
) by (query_type)

# 追踪关联
# 从 TraceID 关联到日志
{service="user-service"} | json | trace_id="4bf92f3577b34da6"

6.2 日志采样策略

# Promtail 配置:日志采样
scrape_configs:
  - job_name: high-volume-service
    static_configs:
      - targets:
          - localhost
        labels:
          service: kafka-consumer
          __path__: /var/log/app/*.log

    pipeline_stages:
      - json:
          expressions:
            level: level
            trace_id: trace_id

      - labels:
          level:
          trace_id:

    # 采样配置:只采集 10% 的 Info 日志,保留所有 Error/Warn
      - match:
          selector: '{service="kafka-consumer"} |= "level=\"info\""'
          stages:
            - sampling:
                sampling_fraction: 0.1

      - match:
          selector: '{service="kafka-consumer"} |= "level=\"error\"'
          stages:
            - sampling:
                sampling_fraction: 1.0

七、生产环境最佳实践

7.1 SLO/SLI/SLA 定义

# slo-config.yaml
apiVersion: sloth.dev/v1
kind: SLO
metadata:
  name: user-service-availability
  namespace: monitoring
spec:
  service: user-service
  sli:
    plugin:
      id: prometheus/availability
      options:
        total_metric: http_requests_total
        error_metric: http_requests_total{status=~"5.."}
  objectives:
    - ratio: 0.999
      window: 30d
      display_name: "99.9% Availability"
  alerting:
    name: UserServiceAvailability
    labels:
      service: user-service
      severity: critical
    annotations:
      summary: "User Service availability SLO breach"
      # 严重告警:可用性低于 99.9%
      __BurnRateThreshold__: "14.4"  # 1h error budget burn rate

---
# 错误预算策略
# 30 天 99.9% = 43 分钟最大允许停机时间
# 告警触发:当 1 小时消耗超过 14.4% 的日错误预算时

7.2 可观测性采集成本优化

# OpenTelemetry 采样配置
otel_config:
  sampling:
    # 尾部采样:保留所有错误和慢请求
    tail_sampling:
      decision_wait: 10s
      num_traces: 100000
      expected_new_traces_per_sec: 50
      policies:
        # 保留所有错误 Trace
        - name: errors-policy
          type: status_code
          status_code: {status_codes: [ERROR]}

        # 保留慢请求(>1s)
        - name: slow-traces-policy
          type: latency
          latency: {threshold_ms: 1000}

        # 保留采样率 1% 的正常请求
        - name: probabilistic-policy
          type: probabilistic
          probabilistic: {sampling_percentage: 1}

        # 保留特定业务操作
        - name: business-operations-policy
          type: attribute_filter
          attribute_filter:
            key: operation.name
            values: [checkout, payment, register]

7.3 OpenTelemetry Collector 部署

# otel-collector-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  namespace: monitoring
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-pods'
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                  action: keep
                  regex: true

    processors:
      batch:
        timeout: 5s
        send_batch_size: 1024

      memory_limiter:
        check_interval: 1s
        limit_mib: 512
        spike_limit_mib: 128

      k8sattributes:
        passthrough: false
        auth_type: "serviceAccount"
        extract:
          metadata:
            - k8s.namespace.name
            - k8s.deployment.name
            - k8s.pod.name
            - k8s.container.restart_count
            - k8s.container.status_last_terminated_reason

      resourcedetection:
        detectors:
          - env
          - system
          - k8s

    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"
        resource_to_telemetry_conversion:
          enabled: true

      otlp/tempo:
        endpoint: tempo.monitoring:4317
        tls:
          insecure: true

      otlp/loki:
        endpoint: loki.monitoring:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, k8sattributes, resourcedetection]
          exporters: [otlp/tempo]

        metrics:
          receivers: [prometheus, otlp]
          processors: [memory_limiter, batch, resourcedetection]
          exporters: [prometheus, otlp/tempo]

        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch, k8sattributes, resourcedetection]
          exporters: [otlp/loki]

八、Checklist 总结

□ 基础设施层
  □ Prometheus + Grafana 部署(持久化存储,HA 配置)
  □ Loki 日志聚合(存储后端 S3/MinIO)
  □ Tempo 分布式追踪(OTLP 接收,保留 48h)
  □ Alertmanager 告警路由(多级通知)

□ 应用接入层
  □ OpenTelemetry SDK 初始化(Resource 统一)
  □ Flask/FastAPI/Gradio 自动插桩
  □ 业务关键路径手动 Span 插桩
  □ 三大信号(Metrics/Logs/Traces)关联打通

□ Dashboard 设计
  □ 基础设施层:CPU/内存/磁盘/网络
  □ 应用性能层:QPS/错误率/P99延迟
  □ 业务层:转化率/订单量/DAU

□ 告警体系
  □ 基础设施:CPU/内存/PVC/SLN
  □ 应用:5xx 错误率/P99 延迟/服务不可用
  □ 业务:GMV 异常/注册量骤降
  □ SLO 错误预算监控

□ 成本优化
  □ 尾部采样(保留错误 + 慢请求)
  □ 日志采样(Info 10%,Error 100%)
  □ 数据保留策略(Metrics 15d,Logs 30d,Traces 48h)
  □ OTLP 压缩(gzip 压缩传输)

□ 运维保障
  □ On-call 轮值 + Runbook 文档
  □ 告警抑制(避免告警风暴)
  □ 变更窗口灰度发布流程

总结

一句话总结: 可观测性 = Metrics(Prometheus)+ Logs(Loki)+ Traces(Tempo),通过 OpenTelemetry 统一采集规范,Grafana 一站式可视化,构建覆盖「基础设施→应用→业务」的全链路可观测体系。

核心技术栈:

组件 作用 关键配置
OpenTelemetry SDK 统一插桩规范 auto-instrument + 手动 Span
Prometheus 时序指标存储 remoteWrite + 告警规则
Loki 日志聚合 schema v12 + S3 后端
Tempo 分布式追踪 OTLP 接收 + S3 存储
Grafana 可视化面板 Multi-tenancy + SSO
Alertmanager 告警路由 分级通知 + 抑制规则

下一步推荐:

  • CI/CD 自动化流水线(GitHub Actions + ArgoCD + 容器化部署)
  • K8s 高级运维(eBPF + Gateway API + 多集群管理)
Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐