OTLP 协议支持：OpenTelemetry 原生集成

zhanghongbin01

413人浏览 · 2026-04-11 11:33:18

zhanghongbin01 · 2026-04-11 11:33:18 发布

OTLP 协议支持：OpenTelemetry 原生集成

深入了解 AI Observability Agent 的 OpenTelemetry 集成能力

OpenTelemetry 简介

什么是 OpenTelemetry？

OpenTelemetry（简称 OTel）是一个开源的可观测性框架，提供了统一的标准和工具集，用于生成、收集和导出遥测数据（指标、日志、追踪）。

核心价值：

标准化：统一的遥测数据格式和采集标准
可扩展：丰富的插件生态系统
厂商中立：支持多种后端存储
跨语言：支持多种编程语言

OpenTelemetry Protocol (OTLP)

OTLP 是 OpenTelemetry 的标准数据传输协议，定义了遥测数据如何在不同组件之间传输。

协议版本：v1.0+
传输方式：

gRPC：高性能二进制协议，默认端口 4317
HTTP/JSON：基于 HTTP 的文本协议，默认端口 4318

OTLP 协议详解

1. gRPC 接收器

AI Observability Agent 通过 gRPC 协议接收 OTLP 指标：

配置：

otlp:
  enabled: true
  grpc_endpoint: 0.0.0.0:4317  # gRPC 监听地址

技术实现：

使用 tonic crate 实现 gRPC 服务
实现 collector.metrics.v1.MetricsService/Export 方法
异步处理请求，支持高并发

性能特性：

二进制协议，传输效率高
支持流式传输
适合高吞吐量场景
支持 TLS 加密

2. HTTP 接收器

Agent 同时支持 HTTP 协议接收 OTLP 指标：

配置：

otlp:
  enabled: true
  http_endpoint: 0.0.0.0:4318  # HTTP 监听地址

技术实现：

实现 /v1/metrics 端点
支持 JSON 格式数据
异步处理 HTTP 请求

使用场景：

网络环境限制 gRPC 的场景
简单集成场景
调试和测试

3. 接收流程

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  OTLP 发送方     │────→│  OTLP 接收器     │────→│  指标转换器     │
│  (Claude Code)  │     │  (gRPC/HTTP)    │     │  (OTLP→Prom)    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                       │
                                                       ↓
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  远程存储        │←────│  数据处理        │←────│  指标处理        │
│  (Prometheus)   │     │  (成本/质量)     │     │  (缓存/批处理)   │
└─────────────────┘     └─────────────────┘     └─────────────────┘

步骤：

接收 OTLP 指标数据
解析协议缓冲区
转换为 Prometheus 格式
处理和增强指标
推送到远程存储

指标类型转换

OpenTelemetry 支持多种指标类型，Agent 会自动将其转换为 Prometheus 格式。

1. Gauge

OTLP 定义：

表示一个可以任意上下波动的数值
例如：温度、内存使用量

转换规则：

直接映射为 Prometheus Gauge
保留所有标签
时间戳保持不变

示例：

# OTLP
metric_name{label="value"} 123

# Prometheus
metric_name{label="value"} 123

2. Counter

OTLP 定义：

表示一个只增不减的数值
例如：请求数、错误数

转换规则：

映射为 Prometheus Counter
支持重置（通过 start_time_unix_nano）
保留所有标签

示例：

# OTLP
http_requests_total{method="GET"} 100

# Prometheus
http_requests_total{method="GET"} 100

3. Histogram

OTLP 定义：

表示观测值的分布情况
包含 bucket 计数、总和、样本数

转换规则：

生成三个 Prometheus 指标：
- _bucket：各桶的计数
- _sum：所有观测值的总和
- _count：观测值的数量
桶边界保持不变

示例：

# OTLP
execution_time_bucket{le="0.1"} 5
execution_time_bucket{le="0.5"} 15
execution_time_bucket{le="1.0"} 25
execution_time_sum 10.5
execution_time_count 25

# Prometheus
execution_time_bucket{le="0.1"} 5
execution_time_bucket{le="0.5"} 15
execution_time_bucket{le="1.0"} 25
execution_time_bucket{le="+Inf"} 25
execution_time_sum 10.5
execution_time_count 25

4. Summary

OTLP 定义：

表示观测值的分位数
包含分位数值、总和、样本数

转换规则：

生成三个 Prometheus 指标：
- _sum：所有观测值的总和
- _count：观测值的数量
- {quantile="..."}：各分位数值

示例：

# OTLP
response_time_sum 100.5
response_time_count 10
response_time{quantile="0.5"} 8.5
response_time{quantile="0.9"} 15.2
response_time{quantile="0.99"} 25.7

# Prometheus
response_time_sum 100.5
response_time_count 10
response_time{quantile="0.5"} 8.5
response_time{quantile="0.9"} 15.2
response_time{quantile="0.99"} 25.7

5. 标签处理

标签转换规则：

保持原始标签不变
注入额外标签（如 source, environment）
处理标签冲突（本地标签优先）
支持标签白名单/黑名单

标签标准化：

确保标签名符合 Prometheus 规范
移除无效字符
转换特殊字符为下划线

Claude Code 集成示例

1. 配置 Claude Code

Claude Code 原生支持 OpenTelemetry，只需设置环境变量：

# 启用 OTEL 并指向 Agent
export CLAUDE_CODE_ENABLE_OTEL=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# 可选：设置服务名称
export OTEL_SERVICE_NAME=claude_code

# 启动 Claude Code
claude-code

2. 配置 Agent

# config/agent_config.yaml
agent:
  log_level: info
  listen_address: 0.0.0.0:9090

otlp:
  enabled: true
  grpc_endpoint: 0.0.0.0:4317
  http_endpoint: 0.0.0.0:4318
  prefix: ai            # 指标前缀
  labels:
    source: claude_code
    environment: production

remote_write:
  endpoint: http://prometheus:9090/api/v1/write

3. 验证集成

启动 Agent 后，检查日志：

INFO otlp receiver grpc_endpoint=0.0.0.0:4317 starting
INFO otlp receiver http_endpoint=0.0.0.0:4318 starting
INFO otlp metrics received from=127.0.0.1 count=15

在 Prometheus 中查询 Claude Code 指标：

# 会话数
sum(ai_claude_code_session_count_total)

# Token 使用量
sum(ai_claude_code_token_usage_tokens_total)

# 成本
sum(ai_claude_code_cost_usage_USD_total)

# 代码行数
sum(ai_claude_code_lines_of_code_count_total)

4. 常见问题

问题：Claude Code 无法连接到 Agent
解决：

检查 Agent 是否启动
检查网络连接
确认 OTLP 端点配置正确
查看 Agent 日志是否有错误

问题：指标没有前缀
解决：

在 otlp 配置中设置 prefix 字段

问题：标签不正确
解决：

在 otlp 配置中设置 labels 字段

自定义应用集成

1. 使用 OpenTelemetry SDK

Python 示例：

from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource

# 配置资源
resource = Resource.create({
    "service.name": "my-ai-app",
    "environment": "production"
})

# 创建度量提供器
meter_provider = MeterProvider(
    resource=resource,
    metric_exporters=[OTLPMetricExporter(endpoint="http://localhost:4317")]
)

# 设置全局度量提供器
metrics.set_meter_provider(meter_provider)

# 创建度量仪
meter = metrics.get_meter("my-ai-app")

# 创建指标
token_counter = meter.create_counter(
    name="ai_tokens_total",
    description="Total AI tokens used",
    unit="1"
)

# 记录指标
token_counter.add(
    100,
    {"model": "gpt-4o", "type": "input"}
)

Node.js 示例：

const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');

// 创建导出器
const exporter = new OTLPMetricExporter({
  url: 'http://localhost:4317'
});

// 创建度量提供器
const meterProvider = new MeterProvider();
meterProvider.addMetricReader(exporter);

// 获取度量仪
const meter = meterProvider.getMeter('my-ai-app');

// 创建计数器
const tokenCounter = meter.createCounter('ai_tokens_total', {
  description: 'Total AI tokens used',
  unit: '1'
});

// 记录指标
tokenCounter.add(100, {
  model: 'gpt-4o',
  type: 'input'
});

2. 使用 HTTP 协议

对于不支持 gRPC 的环境，可以使用 HTTP 协议：

Python 示例：

from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter

# 使用 HTTP 导出器
exporter = OTLPMetricExporter(
    endpoint="http://localhost:4318/v1/metrics"
)

curl 示例（用于测试）：

# 发送简单的 OTLP 指标
curl -X POST http://localhost:4318/v1/metrics \
  -H "Content-Type: application/json" \
  -d '{
    "resource_metrics": [{
      "resource": {
        "attributes": [{
          "key": "service.name",
          "value": {"string_value": "test-app"}
        }]
      },
      "scope_metrics": [{
        "scope": {
          "name": "test-scope"
        },
        "metrics": [{
          "name": "test_counter",
          "type": "counter",
          "unit": "1",
          "data": {
            "sum": {
              "data_points": [{
                "value": 100,
                "attributes": [{
                  "key": "label",
                  "value": {"string_value": "value"}
                }]
              }]
            }
          }
        }]
      }]
    }]}
  '

3. 最佳实践

命名规范：

使用 snake_case 命名指标
包含单位（如 _seconds, _bytes, _count）
保持指标名简洁明了

标签使用：

避免高基数标签（如用户 ID、请求 ID）
使用有意义的标签值
保持标签一致性

数据量控制：

设置合理的采集间隔
避免发送过多的时间序列
使用聚合减少数据量

性能优化

1. gRPC 性能调优

连接复用：

启用 gRPC 连接池
合理设置连接超时

批处理：

客户端启用指标批处理
减少网络往返次数

压缩：

启用 gRPC 压缩
减小传输数据大小

2. 接收端优化

并发处理：

调整 gRPC 服务器线程池大小
优化 HTTP 服务器配置

内存管理：

设置合理的缓冲区大小
及时释放内存

处理速度：

优化指标转换逻辑
减少不必要的计算

3. 监控 OTLP 接收

暴露的指标：

otlp_receiver_requests_total：接收请求总数
otlp_receiver_metrics_total：接收指标总数
otlp_receiver_errors_total：接收错误总数
otlp_receiver_request_duration_seconds：请求处理延迟

告警规则：

groups:
- name: otlp_receiver
  rules:
  - alert: OTLPReceiverErrors
    expr: rate(otlp_receiver_errors_total[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: OTLP 接收器错误
      description: OTLP 接收器在过去 5 分钟内出现错误

  - alert: OTLPReceiverHighLatency
    expr: histogram_quantile(0.95, rate(otlp_receiver_request_duration_seconds_bucket[5m])) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: OTLP 接收器延迟高
      description: OTLP 接收器 95th 百分位延迟超过 500ms

与其他系统集成

1. 与 OpenTelemetry Collector 集成

架构：

┌─────────────┐     ┌─────────────────────┐     ┌──────────────────┐
│  AI 应用     │────→│  OpenTelemetry     │────→│  AI Observability │
│  (OTLP)     │     │  Collector         │     │  Agent           │
└─────────────┘     └─────────────────────┘     └──────────────────┘

配置示例：

# otel-collector.yaml
exporters:
  otlp:
    endpoint: "prom-agent:4317"
    tls:
      insecure: true

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:

pipelines:
  metrics:
    receivers: [otlp]
    processors: [batch]
    exporters: [otlp]

2. 与 Prometheus 集成

架构：

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│  AI 应用     │────→│  AI Observability │────→│  Prometheus │
│  (OTLP)     │     │  Agent           │     │             │
└─────────────┘     └──────────────────┘     └─────────────┘

Prometheus 配置：

# prometheus.yml
remote_write_receiver:
  # 启用 Remote Write 接收器

故障排查

1. 连接问题

症状：

OTLP 发送方无法连接到 Agent
日志中出现连接错误

排查步骤：

检查 Agent 是否启动
检查网络连接
确认端口是否开放
检查防火墙设置
查看 Agent 日志

2. 数据丢失

症状：

发送了指标但在 Prometheus 中看不到
Agent 日志中有错误

排查步骤：

检查 OTLP 发送配置
查看 Agent 日志中的错误信息
检查 Remote Write 配置
验证 Prometheus 是否接收到数据

3. 性能问题

症状：

OTLP 接收延迟高
Agent 内存使用高

排查步骤：

检查系统资源使用
调整 gRPC/HTTP 服务器配置
增加 Agent 内存限制
优化发送方批处理设置

总结

AI Observability Agent 的 OTLP 协议支持为 AI 工具监控提供了标准化、高效的解决方案：

原生集成：直接接收 OpenTelemetry 数据
多协议支持：同时支持 gRPC 和 HTTP
自动转换：将 OTLP 指标转换为 Prometheus 格式
高性能：支持高吞吐量场景
易集成：Claude Code 等 AI 工具开箱即用

通过 OTLP 协议，Agent 可以轻松集成到现有的 OpenTelemetry 生态系统中，为 AI 可观测性提供统一的数据管道。

下一步

AI 采集器 - Claude Code、OpenAI、LiteLLM 监控
成本追踪 - AI API 成本计算与预算管理
Grafana 可视化 - 开箱即用的监控面板

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

AMD Ryzen AI Strix Halo架构处理器：如何在笔记本上跑通原本属于服务器的模型？

AtomGit开源社区

CANN-ATB加速库：Transformer推理性能密码

AtomGit开源社区

Claude API中转怎么选？简易api下的国内接入与兼容 OpenAI 接口实践

如果你的目标是，那么结论可以先说在前面：对于已经基于 OpenAI SDK、API 规范或多模型架构开发的团队来说，选择一个，通常是成本最低、上线最快、后续扩展性也最好的做法。尤其当你的项目不只会调用 Claude，还可能接入 GPT、Gemini、DeepSeek、Qwen 等模型时，单独为每个模型维护一套接入逻辑，长期会带来明显的工程负担。相对而言，像。