Kubernetes和机器学习工作负载
·
Kubernetes和机器学习工作负载
🔥 核心概念
Kubernetes已成为运行机器学习工作负载的理想平台,它提供了以下优势:
- 资源管理:高效管理CPU、GPU等计算资源
- 弹性伸缩:根据工作负载自动调整资源
- 容错性:处理节点故障和任务失败
- 标准化部署:使用容器化技术确保环境一致性
- 生态系统:丰富的工具和插件支持
🚀 机器学习工作负载类型
1. 训练工作负载
- 批量训练:大规模数据训练,长时间运行
- 分布式训练:多节点并行训练,加速模型训练
- 超参数调优:自动搜索最佳模型参数
2. 推理工作负载
- 在线推理:低延迟实时预测
- 批量推理:大规模离线预测
- 模型服务:提供API接口访问模型
🔧 GPU资源管理
1. GPU设备插件
# 安装NVIDIA GPU设备插件
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
# 验证GPU资源
kubectl get nodes -o jsonpath='{.items[*].status.capacity}'
2. GPU资源请求
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: gpu-container
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
command: ["python", "-c", "import tensorflow as tf; print(tf.test.is_gpu_available())"]
3. GPU调度策略
| 策略 | 描述 | 适用场景 |
|---|---|---|
| Exclusive | 一个GPU只能被一个Pod使用 | 高性能训练 |
| Shared | 多个Pod共享一个GPU | 轻量级推理 |
| MIG | NVIDIA多实例GPU,物理GPU分割为多个虚拟GPU | 多任务并发 |
📦 分布式训练
1. TensorFlow分布式训练
apiVersion: apps/v1
kind: Job
metadata:
name: tf-distributed-training
spec:
template:
spec:
restartPolicy: Never
containers:
- name: tf-worker
image: tensorflow/tensorflow:latest-gpu
command:
- python
- /app/train.py
env:
- name: TF_CONFIG
value: '{"cluster":{"worker":["tf-worker-0:2222","tf-worker-1:2222"]},"task":{"type":"worker","index":0}}'
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: training-data
mountPath: /app
volumes:
- name: training-data
configMap:
name: training-config
2. PyTorch分布式训练
apiVersion: apps/v1
kind: Job
metadata:
name: pytorch-distributed-training
spec:
parallelism: 2
completions: 2
template:
spec:
restartPolicy: Never
containers:
- name: pytorch-worker
image: pytorch/pytorch:latest
command:
- bash
- -c
- |
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=${NODE_RANK} --master_addr=tf-worker-0 --master_port=29500 /app/train.py
env:
- name: NODE_RANK
valueFrom:
fieldRef:
fieldPath: metadata.name
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: training-data
mountPath: /app
volumes:
- name: training-data
configMap:
name: training-config
3. Kubeflow训练
# 安装Kubeflow
curl -s https://raw.githubusercontent.com/kubeflow/kfctl/master/kfctl.sh | bash
# 部署Kubeflow
kfctl apply -f kfctl_k8s_istio.yaml
# 访问Kubeflow Dashboard
kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80
🌐 模型服务
1. TensorFlow Serving
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-serving
spec:
replicas: 2
selector:
matchLabels:
app: tf-serving
template:
metadata:
labels:
app: tf-serving
spec:
containers:
- name: tf-serving
image: tensorflow/serving:latest
ports:
- containerPort: 8500
- containerPort: 8501
args:
- --model_name=my_model
- --model_base_path=/models/my_model
volumeMounts:
- name: model-volume
mountPath: /models
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
name: tf-serving
spec:
selector:
app: tf-serving
ports:
- port: 8500
targetPort: 8500
name: grpc
- port: 8501
targetPort: 8501
name: rest
type: LoadBalancer
2. PyTorch Serving
apiVersion: apps/v1
kind: Deployment
metadata:
name: torch-serving
spec:
replicas: 2
selector:
matchLabels:
app: torch-serving
template:
metadata:
labels:
app: torch-serving
spec:
containers:
- name: torch-serving
image: pytorch/torchserve:latest
ports:
- containerPort: 8080
- containerPort: 8081
- containerPort: 8082
volumeMounts:
- name: model-volume
mountPath: /model-store
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
name: torch-serving
spec:
selector:
app: torch-serving
ports:
- port: 8080
targetPort: 8080
name: inference
- port: 8081
targetPort: 8081
name: management
type: LoadBalancer
3. Seldon Core
# 安装Seldon Core
helm repo add seldon-charts https://storage.googleapis.com/seldon-charts
helm repo update
helm install seldon-core seldon-charts/seldon-core-operator --namespace seldon-system --create-namespace
# 部署模型服务
kubectl apply -f seldon-deployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: seldon-model
spec:
name: model
predictors:
- componentSpecs:
- spec:
containers:
- name: classifier
image: seldonio/mock_classifier:1.0
graph:
name: classifier
type: MODEL
name: default
replicas: 2
🔄 机器学习流水线
1. Kubeflow Pipelines
# 安装Kubeflow Pipelines
kubectl apply -f https://raw.githubusercontent.com/kubeflow/pipelines/master/manifests/kustomize/crds/kfp-crds.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/pipelines/master/manifests/kustomize/env/platform-agnostic-pns/kustomization.yaml
# 访问Kubeflow Pipelines UI
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
2. 创建流水线
import kfp
from kfp import dsl
@dsl.pipeline(
name='MNIST Training Pipeline',
description='A pipeline to train MNIST model'
)
def mnist_pipeline():
# 数据准备
data_prep = dsl.ContainerOp(
name='Data Preparation',
image='tensorflow/tensorflow:latest',
command=['python', '/app/data_prep.py']
)
# 模型训练
training = dsl.ContainerOp(
name='Model Training',
image='tensorflow/tensorflow:latest-gpu',
command=['python', '/app/train.py'],
resource_requests={'nvidia.com/gpu': '1'}
).after(data_prep)
# 模型评估
evaluation = dsl.ContainerOp(
name='Model Evaluation',
image='tensorflow/tensorflow:latest',
command=['python', '/app/evaluate.py']
).after(training)
# 模型部署
deployment = dsl.ContainerOp(
name='Model Deployment',
image='gcr.io/kubeflow-images-public/kubectl:v1.14.0',
command=['kubectl', 'apply', '-f', '/app/deployment.yaml']
).after(evaluation)
if __name__ == '__main__':
kfp.compiler.Compiler().compile(mnist_pipeline, 'mnist-pipeline.yaml')
📈 监控与可观测性
1. 资源监控
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ml-workloads
namespace: monitoring
spec:
selector:
matchLabels:
app: ml-workload
endpoints:
- port: metrics
interval: 15s
2. 模型性能监控
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ml-model-alerts
namespace: monitoring
spec:
groups:
- name: ml-model
rules:
- alert: ModelAccuracyDrop
expr: model_accuracy{model="my-model"} < 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Model accuracy drop"
description: "Model {{ $labels.model }} accuracy dropped below 80%"
- alert: ModelLatencyHigh
expr: model_inference_latency_seconds{model="my-model"} > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Model latency high"
description: "Model {{ $labels.model }} inference latency is above 500ms"
🔧 最佳实践
1. 数据管理
- 数据分区:将大型数据集分区,便于并行处理
- 数据缓存:使用持久卷缓存训练数据,减少数据加载时间
- 数据版本控制:使用Git LFS或DVC进行数据版本管理
2. 模型管理
- 模型版本控制:使用MLflow或Weights & Biases进行模型版本管理
- 模型注册:建立模型注册中心,管理模型生命周期
- 模型监控:监控模型性能和漂移
3. 资源优化
- GPU选择:根据模型大小和训练需求选择合适的GPU
- 批处理大小:调整批处理大小,充分利用GPU内存
- 混合精度训练:使用FP16加速训练
🚨 故障排查
1. GPU相关问题
# 检查GPU设备状态
kubectl describe node | grep nvidia.com/gpu
# 检查Pod GPU分配
kubectl get pod gpu-pod -o jsonpath='{.spec.containers[*].resources.limits}'
# 查看GPU使用情况
exec -it gpu-pod -- nvidia-smi
2. 训练失败
# 查看训练Pod日志
kubectl logs tf-distributed-training-xxx
# 检查训练Job状态
kubectl get job tf-distributed-training
# 查看事件
kubectl describe job tf-distributed-training
3. 模型服务问题
# 检查模型服务Pod状态
kubectl get pods -l app=tf-serving
# 查看服务日志
kubectl logs -l app=tf-serving
# 测试模型服务
curl -X POST http://tf-serving.default.svc.cluster.local:8501/v1/models/my_model:predict -d '{"instances": [[1.0, 2.0, 3.0]]}'
总结
Kubernetes为机器学习工作负载提供了强大的运行平台,支持从数据准备、模型训练到模型部署的完整机器学习生命周期。通过合理配置和管理Kubernetes资源,可以构建高效、可扩展的机器学习系统:
- GPU资源管理:充分利用GPU加速训练和推理
- 分布式训练:加速大规模模型训练
- 模型服务:提供高性能的模型推理服务
- 机器学习流水线:自动化机器学习工作流程
- 监控与可观测性:实时监控模型性能和资源使用
在生产环境中,建议根据实际需求选择合适的工具和配置,以确保机器学习工作负载的高效运行。
💡 小贴士:机器学习工作负载通常对资源要求较高,建议使用Kubernetes的资源配额和限制功能,确保不同工作负载之间的资源隔离和公平分配。
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐


所有评论(0)