昇腾NPU监控与可观测性——让AI基础设施“被看见“（完整版）

L、218

126人浏览 · 2026-05-25 12:40:59

L、218 · 2026-05-25 12:40:59 发布

在这里插入图片描述

一、监控体系设计：三层架构

在昇腾NPU的生产环境中，我们需要构建一个全栈式的监控体系，覆盖从硬件底层到业务上层的全链路。

1. 架构全景图

2. 核心指标定义

层级	指标名称	单位	阈值建议	意义
硬件层	`npu_compute_util`	%	> 80% 持续5min	算力是否饱和
	`npu_memory_used`	MB	> 90%	显存泄漏风险
	`npu_temperature`	°C	> 75°C	过热降频风险
	`npu_power`	W	> TDP	功耗墙触发
推理层	`inference_latency_p99`	ms	< SLA阈值	用户体验上限
	`inference_qps`	req/s	-	系统吞吐能力
	`inference_error_rate`	%	< 0.1%	稳定性
业务层	`recommendation_ctr`	%	-	业务效果
	`embedding_cache_hit`	%	> 95%	缓存效率

二、代码实现：完整的监控采集器

你提供的 NPUCollector 和 InferenceMetricsCollector 是基础，但在生产环境中需要更健壮的实现（如异步采集、错误重试、多设备聚合）。

1. 增强版 NPU 采集器 (`npumonitor.py`)

import torch
import subprocess
import time
import json
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional
import psutil

@dataclass
class NPUMetrics:
    """NPU指标数据类"""
    timestamp: float
    device_id: int
    compute_util: float      # AI Core利用率 %
    hbm_util: float         # HBM带宽利用率 %
    memory_used_mb: float
    memory_total_mb: float
    temperature_c: float
    power_w: float
    ecc_errors: int = 0     # ECC错误计数
    status: str = "Normal"  # 健康状态

class AscendMonitor:
    """
    昇腾NPU监控采集器
    
    特性：
      1. 支持 npu-smi 和 PyTorch API 双模式
      2. 异步采集，降低对业务影响
      3. 自动解析 nup-smi 输出
      4. 集成 Prometheus Exporter 接口
    """
    
    def __init__(self, device_ids: List[int] = None):
        if device_ids is None:
            device_ids = list(range(torch.npu.device_count()))
        self.device_ids = device_ids
        self.exporter_port = 9090
        
    def collect_all(self) -> List[Dict]:
        """采集所有设备指标并返回JSON格式"""
        metrics_list = []
        
        for dev_id in self.device_ids:
            try:
                metric = self._collect_device(dev_id)
                metrics_list.append(metric)
            except Exception as e:
                print(f"[Error] Device {dev_id} collection failed: {e}")
                metrics_list.append({
                    "device_id": dev_id,
                    "timestamp": time.time(),
                    "status": "Error",
                    "error": str(e)
                })
        
        return metrics_list
    
    def _collect_device(self, device_id: int) -> Dict:
        """采集单卡指标"""
        t0 = time.time()
        
        # 尝试使用 npu-smi (推荐方式)
        try:
            # 获取综合信息
            cmd = ["npu-smi", "info", "-t", "all", "-i", str(device_id), "-o", "json"]
            result = subprocess.run(cmd, capture_output=True, text=True, timeout=5)
            
            if result.returncode == 0:
                data = json.loads(result.stdout)
                info = data.get("devices", [{}])[0].get("device_info", {})
                
                return {
                    "device_id": device_id,
                    "timestamp": time.time(),
                    "compute_util": float(info.get("ai_core_util", 0)),
                    "hbm_util": float(info.get("hbm_bandwidth_util", 0)),
                    "memory_used_mb": float(info.get("memory_used", 0)) / 1024,
                    "memory_total_mb": float(info.get("memory_total", 0)) / 1024,
                    "temperature_c": float(info.get("temperature", 0)),
                    "power_w": float(info.get("power_consumption", 0)),
                    "ecc_errors": int(info.get("ecc_error_count", 0)),
                    "status": "Normal"
                }
        except Exception as e:
            print(f"[Warning] npu-smi failed for device {device_id}: {e}")
        
        # Fallback: 使用 PyTorch API (仅能获取部分信息)
        try:
            torch.npu.set_device(device_id)
            props = torch.npu.get_device_properties(device_id)
            mem_alloc = torch.npu.memory_allocated() / 1024**2
            
            return {
                "device_id": device_id,
                "timestamp": time.time(),
                "compute_util": 0.0,  # PyTorch不直接提供
                "hbm_util": 0.0,
                "memory_used_mb": mem_alloc,
                "memory_total_mb": props.total_memory / 1024**2,
                "temperature_c": 0.0,
                "power_w": 0.0,
                "ecc_errors": 0,
                "status": "Unknown"
            }
        except Exception as e:
            raise RuntimeError(f"Failed to collect metrics for device {device_id}: {e}")

    def prometheus_metrics(self) -> str:
        """生成Prometheus格式的指标文本"""
        metrics = self.collect_all()
        lines = []
        
        for m in metrics:
            prefix = f"ascend_npu_"
            lines.append(f'{prefix}compute_util{{device="{m["device_id"]}"}} {m["compute_util"]}')
            lines.append(f'{prefix}memory_used{{device="{m["device_id"]}"}} {m["memory_used_mb"]}')
            lines.append(f'{prefix}temperature{{device="{m["device_id"]}"}} {m["temperature_c"]}')
            lines.append(f'{prefix}power{{device="{m["device_id"]}"}} {m["power_w"]}')
            lines.append(f'{prefix}status{{device="{m["device_id"]}", state="normal"}} 1')
            
        return "\n".join(lines)

# 使用示例：作为独立Exporter运行
if __name__ == "__main__":
    monitor = AscendMonitor()
    print(monitor.prometheus_metrics())

2. 推理性能监控 (`inference_monitor.py`)

import time
import numpy as np
from collections import deque
from typing import Deque

class InferenceMonitor:
    """
    推理性能实时监控器
    
    使用环形缓冲区维护最近60秒的数据，避免内存泄漏。
    """
    
    def __init__(self, window_size: int = 60):
        self.window_size = window_size
        self.latencies: Deque[float] = deque(maxlen=10000)
        self.timestamps: Deque[float] = deque(maxlen=10000)
        self.success_count = 0
        self.error_count = 0
        self.start_time = time.time()
        
    def record(self, latency_ms: float, success: bool = True):
        """记录一次推理结果"""
        now = time.time()
        self.latencies.append(latency_ms)
        self.timestamps.append(now)
        
        if success:
            self.success_count += 1
        else:
            self.error_count += 1
            
        # 清理过期数据
        cutoff = now - self.window_size
        while self.timestamps and self.timestamps[0] < cutoff:
            self.latencies.popleft()
            self.timestamps.popleft()
    
    def get_stats(self) -> dict:
        """获取当前窗口内的统计指标"""
        if not self.latencies:
            return {"status": "no_data"}
        
        lats = np.array(self.latencies)
        total_req = len(lats) + self.error_count
        error_rate = self.error_count / total_req if total_req > 0 else 0
        
        return {
            "qps": round(total_req / self.window_size, 2),
            "latency_p50": round(np.percentile(lats, 50), 2),
            "latency_p90": round(np.percentile(lats, 90), 2),
            "latency_p95": round(np.percentile(lats, 95), 2),
            "latency_p99": round(np.percentile(lats, 99), 2),
            "latency_mean": round(lats.mean(), 2),
            "latency_std": round(lats.std(), 2),
            "error_rate": round(error_rate * 100, 2),
            "uptime_sec": round(time.time() - self.start_time, 0)
        }

三、Grafana 监控面板配置

将以下JSON导入Grafana即可生成专业的昇腾NPU监控大屏。

1. Grafana Dashboard JSON

{
  "dashboard": {
    "title": "昇腾NPU推理监控大盘",
    "tags": ["ascend", "npu", "ai"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "NPU 实时负载",
        "type": "timeseries",
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "ascend_npu_compute_util{job=\"ascend\"}",
            "legendFormat": "{{device}} CPU Util",
            "color": "#E0B400"
          },
          {
            "expr": "ascend_npu_hbm_util{job=\"ascend\"}",
            "legendFormat": "{{device}} HBM Bandwidth",
            "color": "#FF9830"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 95}
              ]
            }
          }
        }
      },
      {
        "id": 2,
        "title": "显存使用情况",
        "type": "gauge",
        "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
        "targets": [
          {
            "expr": "ascend_npu_memory_used{job=\"ascend\"}",
            "legendFormat": "{{device}} Used",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "bytes",
            "decimals": 1,
            "max": 4096, // 假设4GB显存
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 3000},
                {"color": "red", "value": 3800}
              ]
            }
          }
        }
      },
      {
        "id": 3,
        "title": "推理延迟分布 (P99)",
        "type": "stat",
        "gridPos": {"x": 0, "y": 8, "w": 8, "h": 6},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(inference_latency_bucket[5m]))",
            "legendFormat": "P99 Latency",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "ms",
            "color": {"mode": "thresholds"},
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "orange", "value": 20},
                {"color": "red", "value": 50}
              ]
            }
          }
        }
      },
      {
        "id": 4,
        "title": "温度与功耗趋势",
        "type": "timeseries",
        "gridPos": {"x": 8, "y": 8, "w": 16, "h": 6},
        "targets": [
          {
            "expr": "ascend_npu_temperature{job=\"ascend\"}",
            "legendFormat": "{{device}} Temp (°C)",
            "color": "#F2495C"
          },
          {
            "expr": "ascend_npu_power{job=\"ascend\"}",
            "legendFormat": "{{device}} Power (W)",
            "color": "#5794F2"
          }
        ]
      },
      {
        "id": 5,
        "title": "系统健康状态",
        "type": "table",
        "gridPos": {"x": 0, "y": 14, "w": 24, "h": 6},
        "targets": [
          {
            "expr": "ascend_npu_status{state=\"normal\"}",
            "format": "table",
            "instant": true
          }
        ],
        "transformations": [
          {
            "id": "organize",
            "options": {
              "excludeByName": {},
              "indexByName": {}
            }
          }
        ]
      }
    ],
    "refresh": "5s",
    "time": {
      "from": "now-1h",
      "to": "now"
    }
  }
}

四、告警规则配置 (AlertManager)

当指标异常时，必须第一时间通知。以下是Prometheus告警规则示例。

groups:
  - name: ascend_npu_alerts
    rules:
      # 1. NPU温度过高
      - alert: NpuTemperatureHigh
        expr: ascend_npu_temperature > 75
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "NPU {{ $labels.device }} 温度过高"
          description: "设备 {{ $labels.device }} 温度达到 {{ $value }}°C，超过阈值75°C"

      # 2. 显存使用率过高
      - alert: NpuMemoryUsageHigh
        expr: ascend_npu_memory_used / ascend_npu_memory_total > 0.9
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "NPU {{ $labels.device }} 显存即将耗尽"
          description: "显存使用率 {{ $value | humanizePercentage }}，存在OOM风险"

      # 3. 推理延迟超标
      - alert: InferenceLatencyP99High
        expr: histogram_quantile(0.99, rate(inference_latency_bucket[5m])) > 50
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "推理P99延迟超过50ms"
          description: "当前P99延迟为 {{ $value }}ms，SLA要求<50ms"

      # 4. ECC错误计数增加
      - alert: NpuEccError
        expr: increase(ascend_npu_ecc_errors[1h]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "NPU {{ $labels.device }} 检测到ECC错误"
          description: "过去1小时内新增 {{ $value }} 个ECC错误，可能存在硬件故障"

五、部署与最佳实践

1. 部署方式

Kubernetes环境：
- 使用 DaemonSet 部署 node-exporter 和 ascend-monitor。
- 通过 Prometheus Operator 自动发现服务。
- 利用 Helm Chart 管理版本。
物理机环境：
- 使用 systemd 启动 prometheus 和 grafana。
- 配置 cron 定时任务调用 npu-smi 并写入日志文件。

2. 最佳实践

采样频率：NPU指标建议 15s 一次，推理指标建议 1s 一次。
数据保留：短期数据（7天）存Prometheus，长期数据（1年）存Thanos或S3。
告警分级：
- Critical (电话/短信)：温度>80°C，ECC错误，服务不可用。
- Warning (钉钉/企微)：温度>75°C，显存>90%，延迟>SLA。
- Info (日志)：正常波动，版本更新。
自动化修复：
- 检测到温度过高 -> 自动触发降频脚本。
- 检测到OOM -> 自动重启推理服务。
- 检测到死锁 -> 自动Kill进程并恢复。