GitOps at Scale: Multi-Cluster Management with Flux v2 and Cluster API

Managing 50+ Kubernetes clusters with GitOps using Flux v2 and Cluster API—declarative multi-cluster orchestration, automated cluster lifecycle, and enterprise-scale deployment patterns.

Let me paint you a picture: you’re managing 10 Kubernetes clusters. Dev, staging, prod—multiply that across a few regions and teams. You’re probably thinking “this is manageable.” Then your organization scales. Suddenly it’s 30 clusters. Then 50. Then you’re at 100+ clusters spread across multiple cloud providers, regions, and business units.

I’ve been there. And I can tell you: the tooling that worked for 5 clusters absolutely breaks down at 50. Manual kubectl applies? Forget it. Helm releases managed per-cluster? Nightmare. The only way to maintain sanity at scale is GitOps with proper multi-cluster orchestration.

That’s where Flux v2 and Cluster API come in. I’ve used this stack to manage 120+ production clusters for a global SaaS company, and it’s the only approach I’ve found that actually scales.

Why Multi-Cluster GitOps is Non-Negotiable

First, let’s talk about why you need this. If you’re thinking “we only have a few clusters, GitOps seems like overkill”—I hear you. But here’s what happens as you grow:

Problems at 5-10 clusters:

  • Configuration drift: Someone does a quick kubectl edit in production
  • Deployment inconsistency: Staging has v2.1, prod has v2.0, dev has v2.2
  • Recovery time: A cluster goes down, how long to rebuild it identically?
  • Knowledge silos: “Only Bob knows how to deploy to cluster X”

Problems at 50+ clusters:

  • Everything above, but exponentially worse
  • Manual deployments become impossible
  • Security patches take weeks to roll out across all clusters
  • Disaster recovery is a multi-day process
  • Compliance audits are a nightmare (prove every cluster has the same security policies)

GitOps solves all of this: Git is the single source of truth, automated reconciliation ensures consistency, and cluster provisioning becomes declarative.

The Stack: Flux v2 + Cluster API + Kustomize

Let me show you the architecture I’ve deployed for enterprise multi-cluster management:

graph TB
    subgraph GitRepo["Git Repository (GitHub/GitLab)"]
        Management["Management Config
flux-system/"] Clusters["Cluster Definitions
clusters/"] Apps["Applications
apps/"] Infra["Infrastructure
infrastructure/"] end subgraph ManagementCluster["Management Cluster"] FluxMgmt["Flux Controllers"] CAPI["Cluster API Controllers
(CAPZ for Azure)"] FluxMgmt -->|"watches"| GitRepo FluxMgmt -->|"creates"| CAPI end subgraph Azure["Azure Cloud"] subgraph ProdClusters["Production Clusters"] Prod1["Prod East"] Prod2["Prod West"] Prod3["Prod EU"] end subgraph StagingClusters["Staging Clusters"] Stage1["Stage East"] Stage2["Stage West"] end subgraph DevClusters["Development Clusters"] Dev1["Dev Team A"] Dev2["Dev Team B"] Dev3["Dev Team C"] end end CAPI -->|"provisions via ARM"| Azure FluxMgmt -->|"deploys apps"| ProdClusters FluxMgmt -->|"deploys apps"| StagingClusters FluxMgmt -->|"deploys apps"| DevClusters ProdClusters -->|"pull config"| GitRepo StagingClusters -->|"pull config"| GitRepo DevClusters -->|"pull config"| GitRepo style ManagementCluster fill:#e1f5ff style Azure fill:#d4edda style GitRepo fill:#fff3cd

The key insight: one management cluster orchestrates everything. Workload clusters pull their config from Git. If a cluster dies, Cluster API recreates it automatically. Git stays the source of truth.

Setting Up the Management Cluster

This is your control plane. I typically run it as a dedicated AKS cluster with high availability.

Bootstrap Flux v2

# Install Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash

# Verify installation
flux --version

# Bootstrap Flux to your management cluster
export GITHUB_TOKEN=<your-token>
export GITHUB_USER=<your-username>
export GITHUB_REPO=fleet-management

flux bootstrap github \
  --owner=$GITHUB_USER \
  --repository=$GITHUB_REPO \
  --branch=main \
  --path=clusters/management \
  --personal \
  --components-extra=image-reflector-controller,image-automation-controller

This creates the Flux controllers, commits the manifests to Git, and sets up the GitOps loop. From now on, all changes go through Git—no more manual kubectl.

Install Cluster API Providers

# Enable experimental Cluster API feature
export CLUSTER_TOPOLOGY=true
export EXP_MACHINE_POOL=true

# Install clusterctl
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.6.0/clusterctl-linux-amd64 -o clusterctl
chmod +x clusterctl
sudo mv clusterctl /usr/local/bin/

# Initialize Cluster API with Azure provider (CAPZ)
export AZURE_SUBSCRIPTION_ID=<subscription-id>
export AZURE_TENANT_ID=<tenant-id>
export AZURE_CLIENT_ID=<client-id>
export AZURE_CLIENT_SECRET=<client-secret>

clusterctl init --infrastructure azure

Now your management cluster can provision workload clusters on Azure via declarative manifests. No more clicking through the Azure Portal.

Defining Clusters as Code

This is where it gets powerful. Here’s a complete AKS cluster definition using Cluster API:

# clusters/prod-east/cluster.yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod-east
  namespace: default
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 10.244.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: prod-east-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureCluster
    name: prod-east
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureCluster
metadata:
  name: prod-east
  namespace: default
spec:
  location: eastus
  networkSpec:
    vnet:
      name: prod-east-vnet
      cidrBlocks:
      - 10.0.0.0/16
  resourceGroup: prod-east-rg
  subscriptionID: ${AZURE_SUBSCRIPTION_ID}
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedControlPlane
metadata:
  name: prod-east-control-plane
  namespace: default
spec:
  location: eastus
  resourceGroupName: prod-east-rg
  subscriptionID: ${AZURE_SUBSCRIPTION_ID}
  version: v1.28.3
  networkPlugin: azure
  networkPolicy: calico
  sku:
    tier: Standard
  identity:
    type: SystemAssigned
  aadProfile:
    managed: true
    enableAzureRBAC: true
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedMachinePool
metadata:
  name: prod-east-pool0
  namespace: default
spec:
  mode: System
  sku: Standard_D4s_v5
  osDiskSizeGB: 128
  scaling:
    minSize: 3
    maxSize: 10
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: prod-east-pool0
  namespace: default
spec:
  clusterName: prod-east
  replicas: 3
  template:
    spec:
      clusterName: prod-east
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AzureManagedMachinePool
        name: prod-east-pool0
      version: v1.28.3

Commit this to Git, and Cluster API provisions the entire AKS cluster—VNet, node pools, RBAC, everything. Want 10 more clusters? Copy this file, change the name and region, commit. Done.

Multi-Cluster Application Deployment

Now comes the magic: deploying applications across all clusters from a single Git repo.

Repository Structure

fleet-management/
├── clusters/
│   ├── management/
│   │   └── flux-system/          # Flux controllers
│   ├── prod-east/
│   │   ├── cluster.yaml          # Cluster API definition
│   │   └── flux-config.yaml      # Flux configuration for this cluster
│   ├── prod-west/
│   ├── stage-east/
│   └── dev-team-a/
├── apps/
│   ├── base/                      # Common application manifests
│   │   ├── api/
│   │   ├── frontend/
│   │   └── worker/
│   ├── overlays/
│   │   ├── production/           # Production-specific config
│   │   ├── staging/
│   │   └── development/
├── infrastructure/
│   ├── monitoring/                # Prometheus, Grafana
│   ├── logging/                   # Loki, Promtail
│   ├── ingress/                   # NGINX Ingress Controller
│   └── cert-manager/              # Let's Encrypt integration
└── policies/
    ├── network-policies/
    ├── pod-security-policies/
    └── opa-policies/

Flux Kustomization for Multi-Cluster Deployment

# clusters/prod-east/flux-config.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: fleet-infra
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/myorg/fleet-management
  ref:
    branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 10m
  path: ./infrastructure
  prune: true
  sourceRef:
    kind: GitRepository
    name: fleet-infra
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: ingress-nginx-controller
    namespace: ingress-nginx
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: applications
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: fleet-infra
  dependsOn:
  - name: infrastructure
  postBuild:
    substitute:
      CLUSTER_NAME: prod-east
      CLUSTER_REGION: eastus
      ENVIRONMENT: production

Key features:

  • Infrastructure deployed first (ingress, monitoring, etc.)
  • Applications deployed after infrastructure is healthy (dependsOn)
  • Cluster-specific variables injected via postBuild.substitute
  • Automatic rollback if health checks fail

Progressive Delivery Across Clusters

Here’s how I roll out changes safely across 50+ clusters:

# apps/base/api/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myregistry.azurecr.io/api:v2.5.0  # Managed by Flux image automation

Flux ImageUpdateAutomation for canary rollouts:

# clusters/management/image-update-automation.yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: api-policy
  namespace: flux-system
spec:
  imageRepositoryRef:
    name: api
  policy:
    semver:
      range: 2.x.x
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
  name: api-update
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: fleet-infra
  git:
    checkout:
      ref:
        branch: main
    commit:
      author:
        email: fluxcdbot@users.noreply.github.com
        name: fluxcdbot
      messageTemplate: |
        Automated image update

        Automation name: {{ .AutomationObject }}

        Files:
        {{ range $filename, $_ := .Updated.Files -}}
        - {{ $filename }}
        {{ end }}
    push:
      branch: main
  update:
    path: ./apps/base/api
    strategy: Setters

This workflow:

  1. Flux monitors your container registry for new images
  2. When a new semver tag appears (e.g., v2.5.1), Flux updates the Git repo
  3. Dev clusters pick it up first (because they reconcile faster: interval=1m)
  4. After 1 hour, staging clusters reconcile (interval=60m)
  5. After 24 hours, prod clusters reconcile (interval=1440m)
  6. If health checks fail at any stage, rollout stops

No manual promotion. No forgotten clusters. Git controls everything.

Cluster Lifecycle Automation

One of the most powerful aspects of this setup: clusters become ephemeral. Need a cluster for a 2-week project? Create a YAML file, commit, done. Project over? Delete the YAML, commit, cluster destroyed.

Automated Cluster Creation

# create-cluster.sh
#!/bin/bash

CLUSTER_NAME=$1
REGION=$2
ENVIRONMENT=$3

cat <<EOF > clusters/${CLUSTER_NAME}/cluster.yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: ${CLUSTER_NAME}
  namespace: default
  labels:
    environment: ${ENVIRONMENT}
    region: ${REGION}
spec:
  # ... full Cluster API spec
EOF

cat <<EOF > clusters/${CLUSTER_NAME}/flux-config.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: applications
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/overlays/${ENVIRONMENT}
  # ... Flux config
EOF

git add clusters/${CLUSTER_NAME}/
git commit -m "Add cluster: ${CLUSTER_NAME}"
git push

echo "Cluster ${CLUSTER_NAME} will be provisioned in ~10 minutes"

Run this script:

./create-cluster.sh dev-team-c westus development

In 10 minutes, you have a fully configured AKS cluster with Flux installed, applications deployed, monitoring enabled, and policies enforced. All from Git.

Automated Cluster Upgrades

Kubernetes version upgrades across 50+ clusters used to be a months-long project. With Cluster API, it’s a Git commit:

# Upgrade all production clusters to 1.28.5
find clusters/ -name "cluster.yaml" -path "*/prod-*/*" -exec \
  sed -i 's/version: v1.28.3/version: v1.28.5/g' {} \;

git add clusters/
git commit -m "Upgrade prod clusters to Kubernetes 1.28.5"
git push

Cluster API performs rolling upgrades. If a cluster upgrade fails, it doesn’t proceed to the next. Rollback is another Git commit.

Disaster Recovery

Here’s the nightmare scenario: your entire production region goes down. How fast can you recover?

Before GitOps: Days. Maybe weeks. Manual cluster recreation, manual app deployments, configuration drift, data loss.

With Flux + Cluster API: Hours. Maybe minutes if you’re prepared.

Here’s the DR procedure I’ve tested in production:

# 1. Point to backup Git repo (if primary is down)
flux create source git fleet-infra \
  --url=https://github.com/myorg/fleet-management-backup \
  --branch=main

# 2. Cluster API recreates clusters from Git
kubectl apply -f clusters/prod-east/cluster.yaml

# 3. Once cluster is provisioned, Flux auto-deploys everything
# (5-10 minutes for cluster, 2-3 minutes for apps)

# 4. Restore data from Azure Backup
# (Depends on RPO, typically hourly snapshots)

# Total recovery time: 15-30 minutes for compute, +data restore time

I’ve run this DR drill quarterly for the last 2 years. It works. Consistently.

Cost Savings at Scale

Let me talk numbers. Managing 100 clusters manually:

  • Engineer time: 4-5 full-time SREs just doing deployments and cluster maintenance
  • Hourly cost: $150/hr × 4 engineers × 2080 hours/year = $1.25M/year
  • Deployment lead time: 2-4 weeks for cross-cluster rollouts
  • Incident response time: 2-6 hours (manual cluster recreation)

With Flux + Cluster API:

  • Engineer time: 1.5 FTEs (platform engineering focus, not firefighting)
  • Hourly cost: $150/hr × 1.5 engineers × 2080 hours/year = $468K/year
  • Deployment lead time: 5-30 minutes (progressive delivery)
  • Incident response time: 15-30 minutes (automated cluster recreation)

Annual savings: ~$780K in engineering cost, plus reduced downtime and faster feature velocity.

Key Takeaways

  • Multi-cluster management at scale requires GitOps—manual operations don’t scale beyond 10-15 clusters
  • Flux v2 + Cluster API is the production-proven stack—thousands of organizations running this in prod
  • Declarative cluster provisioning is a game-changer—clusters become YAML files in Git, not Azure Portal clicks
  • Progressive delivery prevents catastrophic failures—roll out to dev → staging → prod automatically, with health checks
  • Disaster recovery becomes routine—tested quarterly, executed in minutes, not days
  • Cost savings are substantial—80%+ reduction in operational overhead at enterprise scale
  • Platform engineering mindset required—this isn’t “DevOps with fancy tools”, it’s treating infrastructure as product

If you’re managing more than 10 clusters and still doing manual deployments, you’re burning money and time. GitOps with Flux and Cluster API isn’t optional at scale—it’s the only approach that actually works.


Scaling to multi-cluster GitOps? I’ve deployed Flux + Cluster API across organizations managing 100+ clusters. Let’s discuss your specific multi-cluster architecture and migration strategy.