Kubernetes和机器学习工作负载:从分布式训练到模型服务的全面解析
·
Kubernetes和机器学习工作负载:从分布式训练到模型服务的全面解析
🔥 硬核开场
各位技术大佬们,今天咱们来聊聊Kubernetes和机器学习工作负载。别以为机器学习只是在单机上跑模型,在云原生时代,Kubernetes已经成为运行机器学习工作负载的最佳平台!今天susu就带你们深入解析Kubernetes上的机器学习工作负载,从分布式训练到模型服务,从GPU调度到资源管理,全给你整明白!
📋 核心内容
1. Kubernetes上运行机器学习的优势
- 弹性伸缩:根据工作负载自动调整资源
- 资源隔离:确保不同工作负载之间的资源隔离
- 标准化部署:使用容器化技术,确保环境一致性
- 高可用性:支持多副本和故障转移
- 集成生态:与CI/CD、监控等工具集成
2. 分布式训练
2.1 TensorFlow on Kubernetes
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tf-distributed-training
spec:
tfReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.8.0-gpu
command:
- python
- /app/train.py
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 3
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.8.0-gpu
command:
- python
- /app/train.py
resources:
limits:
nvidia.com/gpu: 1
PS:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.8.0
command:
- python
- /app/train.py
resources:
limits:
cpu: 2
memory: 4Gi
2.2 PyTorch on Kubernetes
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-distributed-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
command:
- python
- /app/train.py
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 3
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
command:
- python
- /app/train.py
resources:
limits:
nvidia.com/gpu: 1
2.3 Kubeflow安装
# 安装Kubeflow
curl -s https://raw.githubusercontent.com/kubeflow/kfctl/v1.2.0/kfctl.sh | bash
# 配置Kubeflow
export KF_NAME=my-kubeflow
export BASE_DIR=/home/$USER/kubeflow
export KF_DIR=${BASE_DIR}/${KF_NAME}
mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -V -f https://github.com/kubeflow/manifests/raw/v1.2.0/kfdef/kfctl_k8s_istio.v1.2.0.yaml
# 验证安装
kubectl get pods -n kubeflow
3. 模型服务
3.1 TensorFlow Serving
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow-serving
spec:
replicas: 3
selector:
matchLabels:
app: tensorflow-serving
template:
metadata:
labels:
app: tensorflow-serving
spec:
containers:
- name: tensorflow-serving
image: tensorflow/serving:2.8.0
ports:
- containerPort: 8500
- containerPort: 8501
volumeMounts:
- name: model-volume
mountPath: /models
args:
- --model_name=my-model
- --model_base_path=/models/my-model
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
name: tensorflow-serving
spec:
selector:
app: tensorflow-serving
ports:
- port: 8500
targetPort: 8500
- port: 8501
targetPort: 8501
type: LoadBalancer
3.2 TorchServe
apiVersion: apps/v1
kind: Deployment
metadata:
name: torchserve
spec:
replicas: 3
selector:
matchLabels:
app: torchserve
template:
metadata:
labels:
app: torchserve
spec:
containers:
- name: torchserve
image: pytorch/torchserve:0.4.0
ports:
- containerPort: 8080
- containerPort: 8081
- containerPort: 8082
volumeMounts:
- name: model-volume
mountPath: /model-store
env:
- name: MODEL_NAME
value: my-model
- name: MODEL_STORE
value: /model-store
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
name: torchserve
spec:
selector:
app: torchserve
ports:
- port: 8080
targetPort: 8080
- port: 8081
targetPort: 8081
- port: 8082
targetPort: 8082
type: LoadBalancer
3.3 Seldon Core
# 安装Seldon Core
helm repo add seldon https://charts.seldon.io
helm repo update
helm install seldon-core seldon/seldon-core-operator --namespace seldon-system --create-namespace
# 部署模型
kubectl apply -f - <<EOF
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: model-deployment
namespace: default
spec:
predictors:
- name: model
replicas: 3
graph:
name: model
implementation: TENSORFLOW_SERVER
modelUri: gs://my-bucket/models/my-model
endpoints:
- type: REST
port: 8501
- type: GRPC
port: 8500
EOF
4. GPU管理
4.1 GPU节点配置
# 安装NVIDIA驱动
curl -O https://us.download.nvidia.com/XFree86/Linux-x86_64/470.57.02/NVIDIA-Linux-x86_64-470.57.02.run
chmod +x NVIDIA-Linux-x86_64-470.57.02.run
./NVIDIA-Linux-x86_64-470.57.02.run --silent
# 安装NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
# 验证GPU
nvidia-smi
# 标记节点
kubectl label nodes <node-name> hardware-type=NVIDIA-GPU
4.2 GPU资源调度
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: gpu-container
image: tensorflow/tensorflow:2.8.0-gpu
resources:
limits:
nvidia.com/gpu: 1
command:
- bash
- -c
- |
nvidia-smi
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
4.3 GPU监控
# 安装NVIDIA GPU监控
helm repo add gpu-operator https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator gpu-operator/gpu-operator --namespace gpu-operator --create-namespace
# 查看GPU使用情况
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator deployment/gpu-operator
5. 数据管理
5.1 数据存储
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: nfs-storage
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: data-preprocessing
spec:
replicas: 1
selector:
matchLabels:
app: data-preprocessing
template:
metadata:
labels:
app: data-preprocessing
spec:
containers:
- name: data-preprocessing
image: my-data-preprocessing:latest
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: data-pvc
5.2 数据预处理
# 创建数据预处理Job
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: data-preprocessing
namespace: default
spec:
template:
spec:
containers:
- name: data-preprocessing
image: my-data-preprocessing:latest
command:
- python
- /app/preprocess.py
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: data-pvc
restartPolicy: Never
backoffLimit: 4
EOF
# 查看Job状态
kubectl get job data-preprocessing
kubectl logs job/data-preprocessing
6. 模型训练最佳实践
6.1 超参数调优
# 安装Katib
kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1beta1/installs/katib-standalone.yaml
# 创建超参数调优实验
kubectl apply -f - <<EOF
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: hyperparameter-tuning
namespace: kubeflow
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
additionalMetricNames:
- loss
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
- name: batch_size
parameterType: int
feasibleSpace:
min: "32"
max: "256"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.TrialName}}
namespace: {{.NameSpace}}
spec:
template:
spec:
containers:
- name: training
image: my-training:latest
command:
- python
- /app/train.py
- --learning_rate={{.HyperParameters.learning_rate}}
- --batch_size={{.HyperParameters.batch_size}}
restartPolicy: Never
EOF
# 查看实验状态
kubectl get experiment -n kubeflow
kubectl describe experiment hyperparameter-tuning -n kubeflow
6.2 模型版本管理
# 安装MLflow
helm repo add mlflow https://mlflow.github.io/helm-charts
helm repo update
helm install mlflow mlflow/mlflow --namespace mlflow --create-namespace
# 访问MLflow UI
kubectl port-forward svc/mlflow 5000:5000 -n mlflow
7. 监控与可观测性
7.1 模型性能监控
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: model-serving
namespace: monitoring
spec:
selector:
matchLabels:
app: tensorflow-serving
endpoints:
- port: 8501
interval: 15s
path: /v1/models/my-model/metrics
7.2 训练作业监控
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: training-job
namespace: monitoring
spec:
selector:
matchLabels:
app: training-job
endpoints:
- port: metrics
interval: 15s
8. 安全最佳实践
8.1 访问控制
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ml-operator
namespace: kubeflow
rules:
- apiGroups: ["kubeflow.org"]
resources: ["tfjobs", "pytorchjobs"]
verbs: ["get", "list", "create", "update", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ml-operator-binding
namespace: kubeflow
subjects:
- kind: ServiceAccount
name: ml-operator
namespace: kubeflow
roleRef:
kind: Role
name: ml-operator
apiGroup: rbac.authorization.k8s.io
8.2 数据安全
apiVersion: v1
kind: Secret
metadata:
name: data-credentials
type: Opaque
data:
access_key: <base64-encoded-access-key>
secret_key: <base64-encoded-secret-key>
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: data-processing
spec:
replicas: 1
selector:
matchLabels:
app: data-processing
template:
metadata:
labels:
app: data-processing
spec:
containers:
- name: data-processing
image: my-data-processing:latest
env:
- name: ACCESS_KEY
valueFrom:
secretKeyRef:
name: data-credentials
key: access_key
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: data-credentials
key: secret_key
9. 高级配置
9.1 自定义资源定义
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: mljobs.example.com
spec:
group: example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
modelName:
type: string
trainingConfig:
type: object
properties:
epochs:
type: integer
batchSize:
type: integer
servingConfig:
type: object
properties:
replicas:
type: integer
scope: Namespaced
names:
plural: mljobs
singular: mljob
kind: MLJob
shortNames:
- mlj
9.2 操作符开发
// main.go
package main
import (
"flag"
"fmt"
"os"
"os/signal"
"syscall"
"sigs.k8s.io/controller-runtime/pkg/cache"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/manager"
"sigs.k8s.io/controller-runtime/pkg/manager/signals"
"github.com/example/ml-operator/controllers"
mlv1 "github.com/example/ml-operator/pkg/apis/ml/v1"
)
func main() {
var namespace string
flag.StringVar(&namespace, "namespace", "", "Namespace to watch for MLJob resources")
flag.Parse()
mgr, err := manager.New(cache.Options{
Namespace: namespace,
}, manager.Options{
MetricsBindAddress: ":8080",
})
if err != nil {
fmt.Fprintf(os.Stderr, "Error creating manager: %v\n", err)
os.Exit(1)
}
if err := mlv1.AddToScheme(mgr.GetScheme()); err != nil {
fmt.Fprintf(os.Stderr, "Error adding MLJob scheme: %v\n", err)
os.Exit(1)
}
if err := controllers.AddMLJobController(mgr); err != nil {
fmt.Fprintf(os.Stderr, "Error adding MLJob controller: %v\n", err)
os.Exit(1)
}
stopCh := signals.SetupSignalHandler()
if err := mgr.Start(stopCh); err != nil {
fmt.Fprintf(os.Stderr, "Error starting manager: %v\n", err)
os.Exit(1)
}
}
10. 工具与生态
-
Kubeflow:机器学习工作流平台
- 提供端到端的机器学习工作流
- 支持分布式训练和模型服务
- 集成多种机器学习框架
-
Seldon Core:模型部署平台
- 支持多种模型格式
- 提供自动缩放和负载均衡
- 集成监控和日志
-
MLflow:机器学习生命周期管理
- 实验跟踪
- 模型版本管理
- 模型部署
-
Katib:超参数调优
- 支持多种调优算法
- 并行实验
- 与Kubeflow集成
-
NVIDIA GPU Operator:GPU管理
- 自动安装GPU驱动
- GPU资源监控
- 与Kubernetes集成
🛠️ 最佳实践
-
资源管理:
- 合理设置GPU资源限制
- 使用节点亲和性将工作负载调度到合适的节点
- 监控资源使用情况,及时调整
-
数据管理:
- 使用持久存储存储训练数据和模型
- 实现数据预处理流水线
- 考虑使用对象存储如S3或GCS
-
模型训练:
- 使用分布式训练加速大型模型
- 实现超参数自动调优
- 记录训练过程和结果
-
模型服务:
- 选择合适的模型服务框架
- 实现自动缩放
- 监控模型性能
-
安全配置:
- 限制容器权限
- 管理敏感信息
- 实现访问控制
-
监控与可观测性:
- 监控训练作业和模型服务
- 设置告警
- 分析模型性能
-
自动化:
- 实现CI/CD流水线
- 自动化模型部署
- 集成测试
-
扩展性:
- 设计可扩展的架构
- 考虑多集群部署
- 实现故障转移
-
文档和培训:
- 记录机器学习工作流
- 培训团队成员
- 建立最佳实践指南
-
持续改进:
- 评估模型性能
- 优化资源使用
- 学习和应用新的机器学习技术
📊 总结
Kubernetes已经成为运行机器学习工作负载的理想平台,通过本文的实践,你应该已经掌握了:
- 在Kubernetes上运行分布式训练
- 部署和管理模型服务
- GPU资源管理和调度
- 数据管理和预处理
- 模型训练和超参数调优
- 监控与可观测性
- 安全配置
- 工具与生态系统
记住,机器学习工作负载的运行需要结合业务需求和技术特点,选择合适的工具和配置,确保模型训练和服务的高效、可靠运行。
susu碎碎念:
- GPU资源很宝贵,要合理分配和使用
- 数据管理是机器学习的基础,要重视数据质量
- 模型服务的性能直接影响用户体验,要优化响应时间
- 监控是保证系统稳定运行的关键,要设置合理的告警
- 持续集成和部署可以提高开发效率,要建立自动化流程
- 文档很重要,要记录模型训练和部署的全过程
- 团队协作是成功的关键,要鼓励知识分享
觉得有用?点个赞再走!咱们下期见~ 🔥
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐

所有评论(0)