AKS Upgrade Strategy: Choosing Between Surge, Blue-Green Node Pool, and Blue-Green Cluster Approaches

A comprehensive technical comparison of AKS upgrade strategies including surge upgrades, blue-green node pool migrations, and blue-green cluster approaches with real-world implementation patterns.

Divyansh Srivastav • Sep 6, 2025 • Cloud Infrastructure

DevOps & Cloud Architect | Azure | Kubernetes | Terraform | GitOps

Upgrading Azure Kubernetes Service (AKS) clusters is one of those tasks that keeps platform engineers up at night. I’ve seen too many upgrades go sideways because teams didn’t choose the right strategy for their situation. The approach you pick really depends on your workload characteristics, compliance requirements, and how much risk you’re willing to take. In this article, I’ll walk you through a deep technical comparison of three primary upgrade strategies we’ve used in production environments.

Understanding the Upgrade Challenge

Let me start with why we can’t just ignore upgrades. AKS clusters need regular updates for:

Security patches: CVE remediation in Kubernetes components
Feature access: New Kubernetes APIs and capabilities
Support compliance: Microsoft supports only the latest three minor versions
Performance improvements: Bug fixes and optimization

Here’s the thing: the real challenge isn’t just running the upgrade. It’s doing it without impacting running workloads, maintaining SLAs, and having a solid rollback plan if things go wrong.

Strategy 1: Surge-Based Node Pool Upgrades

How It Works

This is AKS’s default upgrade mechanism, and honestly, it’s what most teams start with. The --max-surge parameter controls how many additional nodes get provisioned during a rolling upgrade.

# Upgrade with surge configuration
az aks nodepool upgrade \
  --resource-group production-rg \
  --cluster-name prod-aks-cluster \
  --name systemnp \
  --kubernetes-version 1.28.5 \
  --max-surge 33%

Upgrade Flow:

graph TD
    A[Start Upgrade] --> B[Calculate Surge Nodes]
    B --> C[Provision New Nodes
with Target K8s Version]
    C --> D[Cordon Old Nodes
Prevent New Scheduling]
    D --> E[Drain Old Nodes
Respect PDBs]
    E --> F[Delete Old Nodes]
    F --> G{All Nodes
Upgraded?}
    G -->|No| B
    G -->|Yes| H[Upgrade Complete]

    style A fill:#e1f5ff
    style H fill:#d4edda
    style C fill:#fff3cd
    style E fill:#fff3cd

Upgrade Process Steps:

Calculate surge nodes: ceil(current_node_count * max_surge_percentage)
Provision new nodes with target Kubernetes version
Cordon old nodes (prevent new pod scheduling)
Drain old nodes (evict pods with PodDisruptionBudget awareness)
Delete old nodes
Repeat until all nodes upgraded

Max-Surge Configuration Best Practices

# Conservative approach for production (10-20%)
az aks nodepool upgrade \
  --max-surge 10% \
  --kubernetes-version 1.28.5

# Balanced approach (25-33%)
az aks nodepool upgrade \
  --max-surge 33% \
  --kubernetes-version 1.28.5

# Aggressive approach for non-critical environments (50-100%)
az aks nodepool upgrade \
  --max-surge 100% \
  --kubernetes-version 1.28.5

Max-Surge Selection Matrix:

Environment	Max-Surge	Reasoning
Production (critical)	10-20%	Minimizes simultaneous node count changes
Production (standard)	25-33%	Balanced speed vs. stability
Staging	50%	Faster upgrades, acceptable risk
Development	100%	Maximum speed

Terraform Implementation

resource "azurerm_kubernetes_cluster_node_pool" "system" {
  name                  = "systemnp"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
  vm_size              = "Standard_D4s_v5"
  node_count           = 3

  # Upgrade settings
  upgrade_settings {
    max_surge = "33%"  # or use absolute value like "1"
  }

  # Ensure PDBs are respected
  enable_node_public_ip = false

  tags = {
    Environment = "Production"
    ManagedBy   = "Terraform"
  }
}

Downtime Risk Analysis

Now, here’s where it gets tricky. Several things can go wrong during a surge upgrade:

Pod eviction failures: If pods don’t respect SIGTERM or have long grace periods
PodDisruptionBudget violations: Can block drain operations
Image pull delays: New nodes pulling large images
Application startup time: Slow readiness probes delay traffic routing

I’ve learned the hard way that you need to prepare for these scenarios. Here are the mitigation strategies we use:

# 1. Configure proper PodDisruptionBudgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: critical-service
---
# 2. Implement proper lifecycle hooks
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1.0.0
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]  # Allow time for connection draining
        terminationGracePeriodSeconds: 60
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
# 3. Configure topology spread constraints
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-service
spec:
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: critical-service

Rollback Pattern

Here’s something that surprises people: surge upgrades don’t support automatic rollback. If something goes wrong, you’ll need to intervene manually:

# Identify problematic version
kubectl get nodes -o wide

# Add new node pool with previous version
az aks nodepool add \
  --resource-group production-rg \
  --cluster-name prod-aks-cluster \
  --name rollbacknp \
  --kubernetes-version 1.27.9 \
  --node-count 3

# Cordon upgraded nodes
kubectl cordon -l agentpool=systemnp

# Drain upgraded nodes
kubectl drain -l agentpool=systemnp \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=60

# Delete problematic node pool
az aks nodepool delete \
  --resource-group production-rg \
  --cluster-name prod-aks-cluster \
  --name systemnp

Cost Implications

During Upgrade:

Surge of 33% on 10-node pool = 3.3 → 4 additional nodes
Cost window: Duration of upgrade (typically 15-45 minutes)
Example: D4s_v5 at $0.192/hour × 4 nodes × 0.5 hours = $0.38

Trade-off: Higher surge = faster completion = lower total cost

Strategy 2: Blue-Green Node Pool Upgrades

Architecture Pattern

This is where things get more interesting. The blue-green node pool strategy creates a completely new node pool and lets you migrate workloads in a controlled, deliberate way.

# Step 1: Create green node pool with new version
az aks nodepool add \
  --resource-group production-rg \
  --cluster-name prod-aks-cluster \
  --name greennp \
  --kubernetes-version 1.28.5 \
  --node-count 3 \
  --labels environment=green deployment=active \
  --node-taints deployment=green:NoSchedule  # Prevent automatic scheduling

Migration Workflow

# Step 2: Remove taint to allow scheduling
kubectl taint nodes -l deployment=green deployment=green:NoSchedule-

# Step 3: Cordon blue nodes
kubectl cordon -l environment=blue

# Step 4: Controlled drain with monitoring
for node in $(kubectl get nodes -l environment=blue -o name); do
  echo "Draining $node"
  kubectl drain $node \
    --ignore-daemonsets \
    --delete-emptydir-data \
    --grace-period=120 \
    --timeout=10m

  # Validate pod distribution after each drain
  kubectl get pods -o wide | grep -v "green"

  # Pause for observation
  sleep 60
done

# Step 5: Verify all workloads running on green
kubectl get pods -o wide --all-namespaces | grep -c greennp

# Step 6: Delete blue node pool
az aks nodepool delete \
  --resource-group production-rg \
  --cluster-name prod-aks-cluster \
  --name bluenp

Terraform Implementation

locals {
  active_version  = "1.28.5"
  standby_version = "1.27.9"
  is_green_active = true  # Toggle for cutover
}

resource "azurerm_kubernetes_cluster_node_pool" "blue" {
  count = local.is_green_active ? 0 : 1

  name                  = "bluenp"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
  vm_size              = "Standard_D4s_v5"
  node_count           = 3
  orchestrator_version = local.active_version

  node_labels = {
    environment = "blue"
    deployment  = "active"
  }
}

resource "azurerm_kubernetes_cluster_node_pool" "green" {
  count = local.is_green_active ? 1 : 0

  name                  = "greennp"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
  vm_size              = "Standard_D4s_v5"
  node_count           = 3
  orchestrator_version = local.active_version

  node_labels = {
    environment = "green"
    deployment  = "active"
  }

  node_taints = [
    "deployment=green:NoSchedule"  # Initial taint for controlled migration
  ]
}

Advanced: Using Node Selectors for Gradual Migration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: canary-workload
spec:
  replicas: 6
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: environment
                operator: In
                values:
                - green
      # Fallback to blue if green not available
      tolerations:
      - key: deployment
        operator: Equal
        value: green
        effect: NoSchedule

Rollback Pattern

# Immediate rollback: Switch traffic back to blue pool
kubectl uncordon -l environment=blue
kubectl cordon -l environment=green

# Re-taint green to force migration back
kubectl taint nodes -l environment=green deployment=green:NoSchedule

# Drain green nodes
kubectl drain -l environment=green \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60

Cost Implications

Let’s talk money. During the migration window:

100% additional capacity (both blue and green pools active)
Example: 10 D4s_v5 nodes doubled = 20 nodes
Cost: $0.192/hour × 20 nodes = $3.84/hour
Migration window: 2-4 hours typical = $7.68-$15.36

Trade-off: Yes, it costs more, but you get zero-risk rollback capability. In my experience, that peace of mind is worth every penny for production systems.

Strategy 3: Blue-Green Cluster Upgrades

When to Use This Approach

Now we’re talking about the heavyweight option. I only recommend blue-green cluster strategy for:

Major version upgrades (e.g., 1.26 → 1.28)
Compliance requirements mandating isolated validation
Control plane changes requiring API server updates
Complete infrastructure refresh

Architecture Pattern

# Step 1: Provision green cluster with IaC
az aks create \
  --resource-group production-rg \
  --name prod-aks-green \
  --kubernetes-version 1.28.5 \
  --node-count 3 \
  --network-plugin azure \
  --vnet-subnet-id /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/aks-green-subnet \
  --load-balancer-sku standard \
  --enable-managed-identity

Complete Terraform Implementation

module "aks_green" {
  source = "./modules/aks-cluster"

  cluster_name        = "prod-aks-green"
  kubernetes_version  = "1.28.5"
  resource_group_name = azurerm_resource_group.aks.name
  location            = azurerm_resource_group.aks.location

  subnet_id = azurerm_subnet.aks_green.id

  default_node_pool = {
    name       = "system"
    node_count = 3
    vm_size    = "Standard_D4s_v5"
  }

  # Copy identity and RBAC from blue cluster
  identity_type = "SystemAssigned"

  # Network configuration
  network_plugin     = "azure"
  load_balancer_sku  = "standard"
  outbound_type      = "loadBalancer"

  tags = {
    Environment = "Production"
    Cluster     = "Green"
  }
}

# Traffic Manager for gradual cutover
resource "azurerm_traffic_manager_profile" "aks" {
  name                   = "aks-traffic-manager"
  resource_group_name    = azurerm_resource_group.aks.name
  traffic_routing_method = "Weighted"

  dns_config {
    relative_name = "prod-aks"
    ttl           = 60
  }

  monitor_config {
    protocol = "HTTPS"
    port     = 443
    path     = "/healthz"
  }
}

resource "azurerm_traffic_manager_endpoint" "blue" {
  name                = "blue-cluster"
  resource_group_name = azurerm_resource_group.aks.name
  profile_name        = azurerm_traffic_manager_profile.aks.name
  type                = "azureEndpoints"
  target_resource_id  = azurerm_public_ip.aks_blue_lb.id
  weight              = var.blue_weight  # Start 100, decrease to 0
}

resource "azurerm_traffic_manager_endpoint" "green" {
  name                = "green-cluster"
  resource_group_name = azurerm_resource_group.aks.name
  profile_name        = azurerm_traffic_manager_profile.aks.name
  type                = "azureEndpoints"
  target_resource_id  = azurerm_public_ip.aks_green_lb.id
  weight              = var.green_weight  # Start 0, increase to 100
}

Migration Strategy

# Step 1: Deploy applications to green cluster
kubectl config use-context prod-aks-green
kubectl apply -f ./manifests/

# Step 2: Validate green cluster
kubectl get pods --all-namespaces
kubectl run test-pod --image=curlimages/curl --rm -it -- \
  curl -v http://internal-service.default.svc.cluster.local

# Step 3: Gradual traffic shift via Traffic Manager
# Update terraform variables
blue_weight  = 90  # Start 100
green_weight = 10  # Start 0

terraform apply

# Monitor metrics and logs
# Repeat traffic shift in increments: 80/20, 50/50, 20/80, 0/100

Data Migration Considerations

For stateful workloads:

# Use VolumeSnapshot for persistent data
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: mysql-snapshot
spec:
  volumeSnapshotClassName: csi-azuredisk-vsc
  source:
    persistentVolumeClaimName: mysql-pvc
---
# Restore in green cluster
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-pvc
spec:
  dataSource:
    name: mysql-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: managed-csi-premium

Rollback Pattern

# Immediate rollback: Shift all traffic back to blue
terraform apply -var="blue_weight=100" -var="green_weight=0"

# Or via Azure CLI
az network traffic-manager endpoint update \
  --resource-group production-rg \
  --profile-name aks-traffic-manager \
  --name blue-cluster \
  --weight 100

az network traffic-manager endpoint update \
  --resource-group production-rg \
  --profile-name aks-traffic-manager \
  --name green-cluster \
  --weight 0

Cost Implications

I won’t sugarcoat it: this approach is expensive. During the migration window:

100% infrastructure duplication (clusters, load balancers, IPs)
Example: 10-node cluster = 20 nodes + 2 load balancers + 2 public IPs
Cost: (~$3.84/hour for nodes) + ($0.025/hour × 2 LBs) + ($0.004/hour × 2 IPs)
Total: ~$4/hour
Migration window: 1-7 days typical = $96-$672

Trade-off: It’s the most expensive option, but you get complete isolation and validation capability. For major upgrades or highly regulated environments, it’s often the only responsible choice.

Decision Matrix

Criteria	Surge Upgrade	Blue-Green Node Pool	Blue-Green Cluster
Cost Impact	Low ($0.38)	Medium ($7-15)	High ($96-672)
Downtime Risk	Low-Medium	Very Low	None
Rollback Speed	Slow (30-60 min)	Fast (5-10 min)	Instant
Validation Window	None	Limited	Complete
Complexity	Low	Medium	High
Use Case	Patch upgrades	Minor upgrades	Major upgrades
Skill Level Required	Basic	Intermediate	Advanced

Production Recommendations

So which strategy should you use? Here’s what I recommend based on real-world experience:

For Patch Upgrades (1.27.5 → 1.27.9)

Use: Surge upgrade with 33% max-surge
Reasoning: These are usually just security patches with minimal API changes. The risk is low enough that surge works great.

For Minor Upgrades (1.27 → 1.28)

Use: Blue-Green node pool
Reasoning: You might hit API deprecations, and you’ll want that validation window before fully committing.

For Major Upgrades (1.26 → 1.28)

Use: Blue-Green cluster
Reasoning: Skipping versions means significant API changes. You need that complete isolation to validate everything works before cutting over.

Pre-Upgrade Checklist

Regardless of strategy:

# 1. Check deprecated APIs
kubectl-convert --output-version apps/v1 -f old-deployment.yaml

# 2. Verify PodDisruptionBudgets
kubectl get pdb --all-namespaces

# 3. Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# 4. Review AKS release notes
az aks get-versions --location eastus --output table

# 5. Backup cluster state
kubectl cluster-info dump --output-directory=/backup/$(date +%Y%m%d)

# 6. Test in non-production first
# 7. Schedule during maintenance window
# 8. Ensure monitoring and alerting active

Conclusion

Choosing the right upgrade strategy isn’t about picking the “best” option—it’s about matching the approach to your specific situation. What’s your risk tolerance? What’s your budget? How complex can your operations handle?

In my experience, most production environments benefit from a layered approach: surge for patches, blue-green node pools for minor upgrades, and blue-green clusters for major version changes or compliance-critical scenarios. Once you understand the technical mechanics, cost implications, and rollback patterns of each strategy, you can make informed decisions that balance operational safety with resource efficiency.

The worst upgrade is the one you avoid doing. Pick the right strategy for your context, prepare thoroughly, and you’ll be fine.

About StriveNimbus: We specialize in Azure Kubernetes Service architecture, migration strategies, and operational excellence. Our team has executed hundreds of AKS upgrades across diverse production environments. Contact us for AKS upgrade planning and implementation support.

AKS Upgrade Strategy: Choosing Between Surge, Blue-Green Node Pool, and Blue-Green Cluster Approaches

Understanding the Upgrade Challenge

Strategy 1: Surge-Based Node Pool Upgrades

How It Works

Max-Surge Configuration Best Practices

Terraform Implementation

Downtime Risk Analysis

Rollback Pattern

Cost Implications

Strategy 2: Blue-Green Node Pool Upgrades

Architecture Pattern

Migration Workflow

Terraform Implementation

Advanced: Using Node Selectors for Gradual Migration

Rollback Pattern

Cost Implications

Strategy 3: Blue-Green Cluster Upgrades

When to Use This Approach

Architecture Pattern

Complete Terraform Implementation

Migration Strategy

Data Migration Considerations

Rollback Pattern

Cost Implications

Decision Matrix

Production Recommendations

For Patch Upgrades (1.27.5 → 1.27.9)

For Minor Upgrades (1.27 → 1.28)

For Major Upgrades (1.26 → 1.28)

Pre-Upgrade Checklist

Conclusion

Share this article:

Related Articles

Multi-Region AKS Deployments Using Workload Identity

Securing AKS Workloads with Azure Key Vault CSI Driver: Beyond Environment Variables

Breaking Kubernetes to Build Confidence: A Chaos Mesh POC on Minikube