CloudAI Fusion

m0_74823507

296人浏览 · 2026-03-24 10:36:48

m0_74823507 · 2026-03-24 10:36:48 发布

CloudAI Fusion

Cloud-Native AI Unified Management Platform

CloudAI Fusion is an open-source platform that unifies cloud-native infrastructure management with AI-powered resource scheduling. It solves three core enterprise pain points:

Cloud-native deployment complexity — 68% of enterprises struggle with cluster management
Multi-cloud security fragmentation — 86% of organizations use multi-cloud but face siloed security
AI resource waste — GPU utilization under 30% in traditional deployment models

Key Features

Feature	Description
Multi-Cloud Management	Unified API for Alibaba Cloud ACK, AWS EKS, Azure AKS, GCP GKE, Huawei CCE, Tencent TKE
GPU Topology-Aware Scheduling	NVLink-aware placement, fine-grained GPU sharing, preemption, RL-based optimization
4 AI Agents	Scheduling optimizer, security monitor, cost analyzer, operations automator
LLM Integration	OpenAI GPT-4o / DashScope Qwen-Max / Ollama / vLLM with graceful fallback
AI Chat Assistant	Conversational operations assistant with LLM-powered incident analysis
eBPF Service Mesh	Sidecarless networking via Istio Ambient / Cilium with <1% overhead
Wasm Runtime	Millisecond cold-start functions for serverless & edge workloads
Edge-Cloud Architecture	Three-tier (Cloud → Edge → Terminal) with 50B-parameter edge model support
Security & Compliance	Pod security, network policies, CIS benchmarks, vulnerability scanning, threat detection
Full Observability	Prometheus metrics, OpenTelemetry tracing, Grafana dashboards, intelligent alerting
Feature Toggles	Runtime feature flags with profiles (minimal / standard / full) for modular deployment
Docker Optimized	Multi-stage builds, distroless images for Go services, GPU image reduced from 5-8GB to ~2-3GB

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         CloudAI Fusion                              │
├─────────────┬──────────────┬─────────────┬─────────────────────────┤
│  API Server │  Scheduler   │    Agent    │      AI Engine          │
│   (Go/Gin)  │ (GPU-aware)  │ (DaemonSet) │   (Python/FastAPI)     │
├─────────────┴──────────────┴─────────────┴─────────────────────────┤
│  Auth │ Cloud │ Cluster │ Security │ Monitor │ Mesh │ Wasm │ Edge  │
├───────┴───────┴─────────┴──────────┴─────────┴──────┴──────┴───────┤
│            PostgreSQL  │  Redis  │  Kafka  │  NATS  │  Prometheus     │
└─────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Docker Desktop (with Docker Compose)

Run (one command)

Windows:

cd cloudai-fusion
start.bat

Linux / macOS:

cd cloudai-fusion
chmod +x start.sh && ./start.sh

This starts all services (API Server, Scheduler, Agent, AI Engine, PostgreSQL, Redis, Kafka, NATS, Prometheus, Grafana, Jaeger).

Verify

# Health check
curl http://localhost:8080/healthz

# Login (get JWT token)
curl -X POST http://localhost:8080/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"admin123"}'

# List clusters (use token from login response)
curl http://localhost:8080/api/v1/clusters \
  -H "Authorization: Bearer <token>"

Service Endpoints

Service	URL
API Server	http://localhost:8080
AI Engine	http://localhost:8090
Prometheus	http://localhost:9090
Grafana	http://localhost:3000 (admin / cloudai)
Jaeger	http://localhost:16686

Tech Stack

Layer	Technology
Backend	Go 1.25, Gin, Cobra, Viper, GORM
AI Engine	Python 3.11, FastAPI, PyTorch, TensorFlow, NumPy, scikit-learn, OpenAI SDK
Container	Kubernetes 1.30+, Istio Ambient, Cilium eBPF
Database	PostgreSQL 16, Redis 7
Messaging	Apache Kafka, NATS
Monitoring	Prometheus, Grafana, OpenTelemetry, Jaeger
CI/CD	GitHub Actions, Docker, Helm 3

Project Structure

cloudai-fusion/
├── cmd/                          # Service entry points
│   ├── apiserver/                # API Server
│   ├── scheduler/                # GPU Scheduler
│   ├── agent/                    # Node Agent
│   └── healthcheck/              # Lightweight healthcheck binary (for distroless)
├── pkg/                          # Go packages (48 packages, 47/47 tests pass)
│   ├── api/                      # HTTP routes, handlers, debug endpoints (6-layer auth)
│   ├── auth/                     # JWT + RBAC + OIDC federation (4 roles, 20+ perms)
│   ├── cache/                    # Redis cache + distributed lock + PubSub
│   ├── cloud/                    # Multi-cloud providers (6 clouds)
│   ├── cluster/                  # Kubernetes cluster management
│   ├── config/                   # Unified configuration + JWT secret validation
│   ├── edge/                     # Edge-cloud architecture + 50B model quantization
│   ├── feature/                  # Runtime feature toggles (minimal/standard/full)
│   ├── mesh/                     # eBPF service mesh
│   ├── monitor/                  # Prometheus metrics & alerting
│   ├── scheduler/                # GPU scheduling + queue snapshot persistence
│   ├── security/                 # Policy, scanning, compliance, threats
│   ├── store/                    # Database layer (GORM)
│   └── wasm/                     # WebAssembly container runtime
├── ai/                           # Python AI components
│   ├── agents/                   # Multi-agent FastAPI server + LLM client
│   ├── anomaly/                  # Anomaly detection engine
│   └── scheduler/                # RL scheduling optimizer
├── deploy/helm/                  # Helm chart (7 templates)
├── docker/                       # Dockerfiles (5 services, distroless + multi-stage)
├── monitoring/                   # Prometheus + Grafana configs
├── scripts/                      # env-generate, diagnose, setup-gpu, deploy-cloud
├── api/                          # OpenAPI 3.1 specification
├── .devcontainer/                # Dev Container (Go + Python + full toolchain)
├── docker-compose.yml            # Full-stack local deployment (profiles support)
├── docker-compose.gpu.yml        # GPU overlay (NVIDIA runtime)
├── Makefile                      # Build, test, deploy, diagnose commands
└── start.bat / start.sh          # One-click start (--gpu, --minimal flags)

Development

# First-time setup (generate .env with secure secrets)
make setup

# Build all Go binaries
make build

# Run tests
make test

# Run tests with coverage report
make coverage-report

# Run linter
make lint

# Build Docker images (optimized: distroless for Go, multi-stage for AI)
make docker-build

# Deploy to Kubernetes
make helm-install

# Start with GPU support
./start.sh --gpu          # Linux / macOS
start.bat --gpu           # Windows (WSL2 + NVIDIA)

# Start minimal mode (core services only)
./start.sh --minimal

# Health diagnostics
make diagnose

# Feature flag management
make features-list

API Overview

Full specification: api/openapi.yaml

Endpoint Group	Description
`POST /api/v1/auth/login`	JWT authentication
`GET /api/v1/clusters`	List managed K8s clusters
`GET /api/v1/providers`	List cloud providers
`POST /api/v1/workloads`	Submit AI workload
`GET /api/v1/security/policies`	Security policies
`GET /api/v1/monitoring/alerts/events`	Alert events
`GET /api/v1/cost/summary`	Cost analysis
`GET /api/v1/mesh/status`	Service mesh status
`POST /api/v1/wasm/deploy`	Deploy Wasm function
`GET /api/v1/edge/topology`	Edge-cloud topology
AI Engine (port 8090)
`POST /api/v1/scheduling/optimize`	LLM-enhanced GPU scheduling
`POST /api/v1/anomaly/detect`	Anomaly detection + threat analysis
`POST /api/v1/cost/analyze`	Cost optimization with LLM insights
`GET /api/v1/insights`	Dynamic AI-powered insights
`GET /api/v1/models/status`	Honest model & LLM status
`POST /api/v1/chat`	Conversational AI assistant
`POST /api/v1/ops/incident`	Incident root cause analysis
`POST /api/v1/ops/scaling`	Predictive scaling
`GET /api/v1/ops/history`	Incident history
Debug (requires `CLOUDAI_DEBUG_ENABLED=true` + JWT admin)
`GET /debug/info`	Runtime information
`GET /debug/pprof/`	Go pprof (rate-limited)
`PUT /debug/log-level`	Dynamic log level
`GET /debug/services`	Cross-service health probe

Roadmap

Version	Timeline	Focus
v0.1 MVP	2026 Q2	Core management, basic scheduling, auth, monitoring
v1.0	2026 Q3	Multi-cloud SDK integration, Istio Ambient, AIOps
v2.0	2026 Q4	Full Agent system, edge deployment, cost optimization
v3.0	2027 H1	Cross-cloud security, heterogeneous scheduling, plugin ecosystem

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache License 2.0

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

《数据库性能飞跃：SQL优化与Explain实战指南》

AtomGit开源社区

酷睿Ultra加持下的2026学习利器：三款长续航轻薄本如何重新定义校园效率

这意味着繁重的AI推理任务由NPU独立承担，不仅响应速度更快，而且不占用CPU资源，不增加风扇噪音，更不显著消耗电量，让学生在图书馆也能安静地运行本地AI工具。其中，联想小新Pro 16 GT AI元启版、华硕无畏Pro14酷睿版2026以及惠普HP星Book Pro 16 2026三款机型，凭借各自在续航、便携及专业适配上的独特优势，为不同专业背景的学生提供了多样化的解决方案。正是这种精细化的功