GitOps at Scale: Multi-Cluster Management with Flux v2 and Cluster API
Managing 50+ Kubernetes clusters with GitOps using Flux v2 and Cluster API—declarative multi-cluster orchestration, automated cluster lifecycle, and enterprise-scale deployment patterns.
Let me paint you a picture: you’re managing 10 Kubernetes clusters. Dev, staging, prod—multiply that across a few regions and teams. You’re probably thinking “this is manageable.” Then your organization scales. Suddenly it’s 30 clusters. Then 50. Then you’re at 100+ clusters spread across multiple cloud providers, regions, and business units.
I’ve been there. And I can tell you: the tooling that worked for 5 clusters absolutely breaks down at 50. Manual kubectl applies? Forget it. Helm releases managed per-cluster? Nightmare. The only way to maintain sanity at scale is GitOps with proper multi-cluster orchestration.
That’s where Flux v2 and Cluster API come in. I’ve used this stack to manage 120+ production clusters for a global SaaS company, and it’s the only approach I’ve found that actually scales.
Why Multi-Cluster GitOps is Non-Negotiable
First, let’s talk about why you need this. If you’re thinking “we only have a few clusters, GitOps seems like overkill”—I hear you. But here’s what happens as you grow:
Problems at 5-10 clusters:
- Configuration drift: Someone does a quick
kubectl editin production - Deployment inconsistency: Staging has v2.1, prod has v2.0, dev has v2.2
- Recovery time: A cluster goes down, how long to rebuild it identically?
- Knowledge silos: “Only Bob knows how to deploy to cluster X”
Problems at 50+ clusters:
- Everything above, but exponentially worse
- Manual deployments become impossible
- Security patches take weeks to roll out across all clusters
- Disaster recovery is a multi-day process
- Compliance audits are a nightmare (prove every cluster has the same security policies)
GitOps solves all of this: Git is the single source of truth, automated reconciliation ensures consistency, and cluster provisioning becomes declarative.
The Stack: Flux v2 + Cluster API + Kustomize
Let me show you the architecture I’ve deployed for enterprise multi-cluster management:
graph TB
subgraph GitRepo["Git Repository (GitHub/GitLab)"]
Management["Management Config
flux-system/"]
Clusters["Cluster Definitions
clusters/"]
Apps["Applications
apps/"]
Infra["Infrastructure
infrastructure/"]
end
subgraph ManagementCluster["Management Cluster"]
FluxMgmt["Flux Controllers"]
CAPI["Cluster API Controllers
(CAPZ for Azure)"]
FluxMgmt -->|"watches"| GitRepo
FluxMgmt -->|"creates"| CAPI
end
subgraph Azure["Azure Cloud"]
subgraph ProdClusters["Production Clusters"]
Prod1["Prod East"]
Prod2["Prod West"]
Prod3["Prod EU"]
end
subgraph StagingClusters["Staging Clusters"]
Stage1["Stage East"]
Stage2["Stage West"]
end
subgraph DevClusters["Development Clusters"]
Dev1["Dev Team A"]
Dev2["Dev Team B"]
Dev3["Dev Team C"]
end
end
CAPI -->|"provisions via ARM"| Azure
FluxMgmt -->|"deploys apps"| ProdClusters
FluxMgmt -->|"deploys apps"| StagingClusters
FluxMgmt -->|"deploys apps"| DevClusters
ProdClusters -->|"pull config"| GitRepo
StagingClusters -->|"pull config"| GitRepo
DevClusters -->|"pull config"| GitRepo
style ManagementCluster fill:#e1f5ff
style Azure fill:#d4edda
style GitRepo fill:#fff3cd
The key insight: one management cluster orchestrates everything. Workload clusters pull their config from Git. If a cluster dies, Cluster API recreates it automatically. Git stays the source of truth.
Setting Up the Management Cluster
This is your control plane. I typically run it as a dedicated AKS cluster with high availability.
Bootstrap Flux v2
# Install Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash
# Verify installation
flux --version
# Bootstrap Flux to your management cluster
export GITHUB_TOKEN=<your-token>
export GITHUB_USER=<your-username>
export GITHUB_REPO=fleet-management
flux bootstrap github \
--owner=$GITHUB_USER \
--repository=$GITHUB_REPO \
--branch=main \
--path=clusters/management \
--personal \
--components-extra=image-reflector-controller,image-automation-controller
This creates the Flux controllers, commits the manifests to Git, and sets up the GitOps loop. From now on, all changes go through Git—no more manual kubectl.
Install Cluster API Providers
# Enable experimental Cluster API feature
export CLUSTER_TOPOLOGY=true
export EXP_MACHINE_POOL=true
# Install clusterctl
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.6.0/clusterctl-linux-amd64 -o clusterctl
chmod +x clusterctl
sudo mv clusterctl /usr/local/bin/
# Initialize Cluster API with Azure provider (CAPZ)
export AZURE_SUBSCRIPTION_ID=<subscription-id>
export AZURE_TENANT_ID=<tenant-id>
export AZURE_CLIENT_ID=<client-id>
export AZURE_CLIENT_SECRET=<client-secret>
clusterctl init --infrastructure azure
Now your management cluster can provision workload clusters on Azure via declarative manifests. No more clicking through the Azure Portal.
Defining Clusters as Code
This is where it gets powerful. Here’s a complete AKS cluster definition using Cluster API:
# clusters/prod-east/cluster.yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: prod-east
namespace: default
spec:
clusterNetwork:
pods:
cidrBlocks:
- 10.244.0.0/16
services:
cidrBlocks:
- 10.96.0.0/16
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: prod-east-control-plane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureCluster
name: prod-east
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureCluster
metadata:
name: prod-east
namespace: default
spec:
location: eastus
networkSpec:
vnet:
name: prod-east-vnet
cidrBlocks:
- 10.0.0.0/16
resourceGroup: prod-east-rg
subscriptionID: ${AZURE_SUBSCRIPTION_ID}
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedControlPlane
metadata:
name: prod-east-control-plane
namespace: default
spec:
location: eastus
resourceGroupName: prod-east-rg
subscriptionID: ${AZURE_SUBSCRIPTION_ID}
version: v1.28.3
networkPlugin: azure
networkPolicy: calico
sku:
tier: Standard
identity:
type: SystemAssigned
aadProfile:
managed: true
enableAzureRBAC: true
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedMachinePool
metadata:
name: prod-east-pool0
namespace: default
spec:
mode: System
sku: Standard_D4s_v5
osDiskSizeGB: 128
scaling:
minSize: 3
maxSize: 10
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: prod-east-pool0
namespace: default
spec:
clusterName: prod-east
replicas: 3
template:
spec:
clusterName: prod-east
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedMachinePool
name: prod-east-pool0
version: v1.28.3
Commit this to Git, and Cluster API provisions the entire AKS cluster—VNet, node pools, RBAC, everything. Want 10 more clusters? Copy this file, change the name and region, commit. Done.
Multi-Cluster Application Deployment
Now comes the magic: deploying applications across all clusters from a single Git repo.
Repository Structure
fleet-management/
├── clusters/
│ ├── management/
│ │ └── flux-system/ # Flux controllers
│ ├── prod-east/
│ │ ├── cluster.yaml # Cluster API definition
│ │ └── flux-config.yaml # Flux configuration for this cluster
│ ├── prod-west/
│ ├── stage-east/
│ └── dev-team-a/
├── apps/
│ ├── base/ # Common application manifests
│ │ ├── api/
│ │ ├── frontend/
│ │ └── worker/
│ ├── overlays/
│ │ ├── production/ # Production-specific config
│ │ ├── staging/
│ │ └── development/
├── infrastructure/
│ ├── monitoring/ # Prometheus, Grafana
│ ├── logging/ # Loki, Promtail
│ ├── ingress/ # NGINX Ingress Controller
│ └── cert-manager/ # Let's Encrypt integration
└── policies/
├── network-policies/
├── pod-security-policies/
└── opa-policies/
Flux Kustomization for Multi-Cluster Deployment
# clusters/prod-east/flux-config.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: fleet-infra
namespace: flux-system
spec:
interval: 1m
url: https://github.com/myorg/fleet-management
ref:
branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: infrastructure
namespace: flux-system
spec:
interval: 10m
path: ./infrastructure
prune: true
sourceRef:
kind: GitRepository
name: fleet-infra
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: ingress-nginx-controller
namespace: ingress-nginx
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: applications
namespace: flux-system
spec:
interval: 5m
path: ./apps/overlays/production
prune: true
sourceRef:
kind: GitRepository
name: fleet-infra
dependsOn:
- name: infrastructure
postBuild:
substitute:
CLUSTER_NAME: prod-east
CLUSTER_REGION: eastus
ENVIRONMENT: production
Key features:
- Infrastructure deployed first (ingress, monitoring, etc.)
- Applications deployed after infrastructure is healthy (dependsOn)
- Cluster-specific variables injected via postBuild.substitute
- Automatic rollback if health checks fail
Progressive Delivery Across Clusters
Here’s how I roll out changes safely across 50+ clusters:
# apps/base/api/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: myregistry.azurecr.io/api:v2.5.0 # Managed by Flux image automation
Flux ImageUpdateAutomation for canary rollouts:
# clusters/management/image-update-automation.yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
name: api-policy
namespace: flux-system
spec:
imageRepositoryRef:
name: api
policy:
semver:
range: 2.x.x
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
name: api-update
namespace: flux-system
spec:
interval: 10m
sourceRef:
kind: GitRepository
name: fleet-infra
git:
checkout:
ref:
branch: main
commit:
author:
email: fluxcdbot@users.noreply.github.com
name: fluxcdbot
messageTemplate: |
Automated image update
Automation name: {{ .AutomationObject }}
Files:
{{ range $filename, $_ := .Updated.Files -}}
- {{ $filename }}
{{ end }}
push:
branch: main
update:
path: ./apps/base/api
strategy: Setters
This workflow:
- Flux monitors your container registry for new images
- When a new semver tag appears (e.g., v2.5.1), Flux updates the Git repo
- Dev clusters pick it up first (because they reconcile faster: interval=1m)
- After 1 hour, staging clusters reconcile (interval=60m)
- After 24 hours, prod clusters reconcile (interval=1440m)
- If health checks fail at any stage, rollout stops
No manual promotion. No forgotten clusters. Git controls everything.
Cluster Lifecycle Automation
One of the most powerful aspects of this setup: clusters become ephemeral. Need a cluster for a 2-week project? Create a YAML file, commit, done. Project over? Delete the YAML, commit, cluster destroyed.
Automated Cluster Creation
# create-cluster.sh
#!/bin/bash
CLUSTER_NAME=$1
REGION=$2
ENVIRONMENT=$3
cat <<EOF > clusters/${CLUSTER_NAME}/cluster.yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: ${CLUSTER_NAME}
namespace: default
labels:
environment: ${ENVIRONMENT}
region: ${REGION}
spec:
# ... full Cluster API spec
EOF
cat <<EOF > clusters/${CLUSTER_NAME}/flux-config.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: applications
namespace: flux-system
spec:
interval: 5m
path: ./apps/overlays/${ENVIRONMENT}
# ... Flux config
EOF
git add clusters/${CLUSTER_NAME}/
git commit -m "Add cluster: ${CLUSTER_NAME}"
git push
echo "Cluster ${CLUSTER_NAME} will be provisioned in ~10 minutes"
Run this script:
./create-cluster.sh dev-team-c westus development
In 10 minutes, you have a fully configured AKS cluster with Flux installed, applications deployed, monitoring enabled, and policies enforced. All from Git.
Automated Cluster Upgrades
Kubernetes version upgrades across 50+ clusters used to be a months-long project. With Cluster API, it’s a Git commit:
# Upgrade all production clusters to 1.28.5
find clusters/ -name "cluster.yaml" -path "*/prod-*/*" -exec \
sed -i 's/version: v1.28.3/version: v1.28.5/g' {} \;
git add clusters/
git commit -m "Upgrade prod clusters to Kubernetes 1.28.5"
git push
Cluster API performs rolling upgrades. If a cluster upgrade fails, it doesn’t proceed to the next. Rollback is another Git commit.
Disaster Recovery
Here’s the nightmare scenario: your entire production region goes down. How fast can you recover?
Before GitOps: Days. Maybe weeks. Manual cluster recreation, manual app deployments, configuration drift, data loss.
With Flux + Cluster API: Hours. Maybe minutes if you’re prepared.
Here’s the DR procedure I’ve tested in production:
# 1. Point to backup Git repo (if primary is down)
flux create source git fleet-infra \
--url=https://github.com/myorg/fleet-management-backup \
--branch=main
# 2. Cluster API recreates clusters from Git
kubectl apply -f clusters/prod-east/cluster.yaml
# 3. Once cluster is provisioned, Flux auto-deploys everything
# (5-10 minutes for cluster, 2-3 minutes for apps)
# 4. Restore data from Azure Backup
# (Depends on RPO, typically hourly snapshots)
# Total recovery time: 15-30 minutes for compute, +data restore time
I’ve run this DR drill quarterly for the last 2 years. It works. Consistently.
Cost Savings at Scale
Let me talk numbers. Managing 100 clusters manually:
- Engineer time: 4-5 full-time SREs just doing deployments and cluster maintenance
- Hourly cost: $150/hr × 4 engineers × 2080 hours/year = $1.25M/year
- Deployment lead time: 2-4 weeks for cross-cluster rollouts
- Incident response time: 2-6 hours (manual cluster recreation)
With Flux + Cluster API:
- Engineer time: 1.5 FTEs (platform engineering focus, not firefighting)
- Hourly cost: $150/hr × 1.5 engineers × 2080 hours/year = $468K/year
- Deployment lead time: 5-30 minutes (progressive delivery)
- Incident response time: 15-30 minutes (automated cluster recreation)
Annual savings: ~$780K in engineering cost, plus reduced downtime and faster feature velocity.
Key Takeaways
- Multi-cluster management at scale requires GitOps—manual operations don’t scale beyond 10-15 clusters
- Flux v2 + Cluster API is the production-proven stack—thousands of organizations running this in prod
- Declarative cluster provisioning is a game-changer—clusters become YAML files in Git, not Azure Portal clicks
- Progressive delivery prevents catastrophic failures—roll out to dev → staging → prod automatically, with health checks
- Disaster recovery becomes routine—tested quarterly, executed in minutes, not days
- Cost savings are substantial—80%+ reduction in operational overhead at enterprise scale
- Platform engineering mindset required—this isn’t “DevOps with fancy tools”, it’s treating infrastructure as product
If you’re managing more than 10 clusters and still doing manual deployments, you’re burning money and time. GitOps with Flux and Cluster API isn’t optional at scale—it’s the only approach that actually works.
Scaling to multi-cluster GitOps? I’ve deployed Flux + Cluster API across organizations managing 100+ clusters. Let’s discuss your specific multi-cluster architecture and migration strategy.