CloudAI Fusion

Cloud-Native AI Unified Management Platform

CI Coverage Go Report License Release


CloudAI Fusion is an open-source platform that unifies cloud-native infrastructure management with AI-powered resource scheduling. It solves three core enterprise pain points:

  • Cloud-native deployment complexity — 68% of enterprises struggle with cluster management
  • Multi-cloud security fragmentation — 86% of organizations use multi-cloud but face siloed security
  • AI resource waste — GPU utilization under 30% in traditional deployment models

Key Features

Feature Description
Multi-Cloud Management Unified API for Alibaba Cloud ACK, AWS EKS, Azure AKS, GCP GKE, Huawei CCE, Tencent TKE
GPU Topology-Aware Scheduling NVLink-aware placement, fine-grained GPU sharing, preemption, RL-based optimization
4 AI Agents Scheduling optimizer, security monitor, cost analyzer, operations automator
LLM Integration OpenAI GPT-4o / DashScope Qwen-Max / Ollama / vLLM with graceful fallback
AI Chat Assistant Conversational operations assistant with LLM-powered incident analysis
eBPF Service Mesh Sidecarless networking via Istio Ambient / Cilium with <1% overhead
Wasm Runtime Millisecond cold-start functions for serverless & edge workloads
Edge-Cloud Architecture Three-tier (Cloud → Edge → Terminal) with 50B-parameter edge model support
Security & Compliance Pod security, network policies, CIS benchmarks, vulnerability scanning, threat detection
Full Observability Prometheus metrics, OpenTelemetry tracing, Grafana dashboards, intelligent alerting
Feature Toggles Runtime feature flags with profiles (minimal / standard / full) for modular deployment
Docker Optimized Multi-stage builds, distroless images for Go services, GPU image reduced from 5-8GB to ~2-3GB

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         CloudAI Fusion                              │
├─────────────┬──────────────┬─────────────┬─────────────────────────┤
│  API Server │  Scheduler   │    Agent    │      AI Engine          │
│   (Go/Gin)  │ (GPU-aware)  │ (DaemonSet) │   (Python/FastAPI)     │
├─────────────┴──────────────┴─────────────┴─────────────────────────┤
│  Auth │ Cloud │ Cluster │ Security │ Monitor │ Mesh │ Wasm │ Edge  │
├───────┴───────┴─────────┴──────────┴─────────┴──────┴──────┴───────┤
│            PostgreSQL  │  Redis  │  Kafka  │  NATS  │  Prometheus     │
└─────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Run (one command)

Windows:

cd cloudai-fusion
start.bat

Linux / macOS:

cd cloudai-fusion
chmod +x start.sh && ./start.sh

This starts all services (API Server, Scheduler, Agent, AI Engine, PostgreSQL, Redis, Kafka, NATS, Prometheus, Grafana, Jaeger).

Verify

# Health check
curl http://localhost:8080/healthz

# Login (get JWT token)
curl -X POST http://localhost:8080/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"admin123"}'

# List clusters (use token from login response)
curl http://localhost:8080/api/v1/clusters \
  -H "Authorization: Bearer <token>"

Service Endpoints

Service URL
API Server http://localhost:8080
AI Engine http://localhost:8090
Prometheus http://localhost:9090
Grafana http://localhost:3000 (admin / cloudai)
Jaeger http://localhost:16686

Tech Stack

Layer Technology
Backend Go 1.25, Gin, Cobra, Viper, GORM
AI Engine Python 3.11, FastAPI, PyTorch, TensorFlow, NumPy, scikit-learn, OpenAI SDK
Container Kubernetes 1.30+, Istio Ambient, Cilium eBPF
Database PostgreSQL 16, Redis 7
Messaging Apache Kafka, NATS
Monitoring Prometheus, Grafana, OpenTelemetry, Jaeger
CI/CD GitHub Actions, Docker, Helm 3

Project Structure

cloudai-fusion/
├── cmd/                          # Service entry points
│   ├── apiserver/                # API Server
│   ├── scheduler/                # GPU Scheduler
│   ├── agent/                    # Node Agent
│   └── healthcheck/              # Lightweight healthcheck binary (for distroless)
├── pkg/                          # Go packages (48 packages, 47/47 tests pass)
│   ├── api/                      # HTTP routes, handlers, debug endpoints (6-layer auth)
│   ├── auth/                     # JWT + RBAC + OIDC federation (4 roles, 20+ perms)
│   ├── cache/                    # Redis cache + distributed lock + PubSub
│   ├── cloud/                    # Multi-cloud providers (6 clouds)
│   ├── cluster/                  # Kubernetes cluster management
│   ├── config/                   # Unified configuration + JWT secret validation
│   ├── edge/                     # Edge-cloud architecture + 50B model quantization
│   ├── feature/                  # Runtime feature toggles (minimal/standard/full)
│   ├── mesh/                     # eBPF service mesh
│   ├── monitor/                  # Prometheus metrics & alerting
│   ├── scheduler/                # GPU scheduling + queue snapshot persistence
│   ├── security/                 # Policy, scanning, compliance, threats
│   ├── store/                    # Database layer (GORM)
│   └── wasm/                     # WebAssembly container runtime
├── ai/                           # Python AI components
│   ├── agents/                   # Multi-agent FastAPI server + LLM client
│   ├── anomaly/                  # Anomaly detection engine
│   └── scheduler/                # RL scheduling optimizer
├── deploy/helm/                  # Helm chart (7 templates)
├── docker/                       # Dockerfiles (5 services, distroless + multi-stage)
├── monitoring/                   # Prometheus + Grafana configs
├── scripts/                      # env-generate, diagnose, setup-gpu, deploy-cloud
├── api/                          # OpenAPI 3.1 specification
├── .devcontainer/                # Dev Container (Go + Python + full toolchain)
├── docker-compose.yml            # Full-stack local deployment (profiles support)
├── docker-compose.gpu.yml        # GPU overlay (NVIDIA runtime)
├── Makefile                      # Build, test, deploy, diagnose commands
└── start.bat / start.sh          # One-click start (--gpu, --minimal flags)

Development

# First-time setup (generate .env with secure secrets)
make setup

# Build all Go binaries
make build

# Run tests
make test

# Run tests with coverage report
make coverage-report

# Run linter
make lint

# Build Docker images (optimized: distroless for Go, multi-stage for AI)
make docker-build

# Deploy to Kubernetes
make helm-install

# Start with GPU support
./start.sh --gpu          # Linux / macOS
start.bat --gpu           # Windows (WSL2 + NVIDIA)

# Start minimal mode (core services only)
./start.sh --minimal

# Health diagnostics
make diagnose

# Feature flag management
make features-list

API Overview

Full specification: api/openapi.yaml

Endpoint Group Description
POST /api/v1/auth/login JWT authentication
GET /api/v1/clusters List managed K8s clusters
GET /api/v1/providers List cloud providers
POST /api/v1/workloads Submit AI workload
GET /api/v1/security/policies Security policies
GET /api/v1/monitoring/alerts/events Alert events
GET /api/v1/cost/summary Cost analysis
GET /api/v1/mesh/status Service mesh status
POST /api/v1/wasm/deploy Deploy Wasm function
GET /api/v1/edge/topology Edge-cloud topology
AI Engine (port 8090)
POST /api/v1/scheduling/optimize LLM-enhanced GPU scheduling
POST /api/v1/anomaly/detect Anomaly detection + threat analysis
POST /api/v1/cost/analyze Cost optimization with LLM insights
GET /api/v1/insights Dynamic AI-powered insights
GET /api/v1/models/status Honest model & LLM status
POST /api/v1/chat Conversational AI assistant
POST /api/v1/ops/incident Incident root cause analysis
POST /api/v1/ops/scaling Predictive scaling
GET /api/v1/ops/history Incident history
Debug (requires CLOUDAI_DEBUG_ENABLED=true + JWT admin)
GET /debug/info Runtime information
GET /debug/pprof/ Go pprof (rate-limited)
PUT /debug/log-level Dynamic log level
GET /debug/services Cross-service health probe

Roadmap

Version Timeline Focus
v0.1 MVP 2026 Q2 Core management, basic scheduling, auth, monitoring
v1.0 2026 Q3 Multi-cloud SDK integration, Istio Ambient, AIOps
v2.0 2026 Q4 Full Agent system, edge deployment, cost optimization
v3.0 2027 H1 Cross-cloud security, heterogeneous scheduling, plugin ecosystem

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache License 2.0

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐