AKS Upgrade Strategy: Choosing Between Surge, Blue-Green Node Pool, and Blue-Green Cluster Approaches
A comprehensive technical comparison of AKS upgrade strategies including surge upgrades, blue-green node pool migrations, and blue-green cluster approaches with real-world implementation patterns.
Upgrading Azure Kubernetes Service (AKS) clusters is one of those tasks that keeps platform engineers up at night. I’ve seen too many upgrades go sideways because teams didn’t choose the right strategy for their situation. The approach you pick really depends on your workload characteristics, compliance requirements, and how much risk you’re willing to take. In this article, I’ll walk you through a deep technical comparison of three primary upgrade strategies we’ve used in production environments.
Understanding the Upgrade Challenge
Let me start with why we can’t just ignore upgrades. AKS clusters need regular updates for:
- Security patches: CVE remediation in Kubernetes components
- Feature access: New Kubernetes APIs and capabilities
- Support compliance: Microsoft supports only the latest three minor versions
- Performance improvements: Bug fixes and optimization
Here’s the thing: the real challenge isn’t just running the upgrade. It’s doing it without impacting running workloads, maintaining SLAs, and having a solid rollback plan if things go wrong.
Strategy 1: Surge-Based Node Pool Upgrades
How It Works
This is AKS’s default upgrade mechanism, and honestly, it’s what most teams start with. The --max-surge parameter controls how many additional nodes get provisioned during a rolling upgrade.
# Upgrade with surge configuration
az aks nodepool upgrade \
--resource-group production-rg \
--cluster-name prod-aks-cluster \
--name systemnp \
--kubernetes-version 1.28.5 \
--max-surge 33%
Upgrade Flow:
graph TD
A[Start Upgrade] --> B[Calculate Surge Nodes]
B --> C[Provision New Nodes
with Target K8s Version]
C --> D[Cordon Old Nodes
Prevent New Scheduling]
D --> E[Drain Old Nodes
Respect PDBs]
E --> F[Delete Old Nodes]
F --> G{All Nodes
Upgraded?}
G -->|No| B
G -->|Yes| H[Upgrade Complete]
style A fill:#e1f5ff
style H fill:#d4edda
style C fill:#fff3cd
style E fill:#fff3cd
Upgrade Process Steps:
- Calculate surge nodes:
ceil(current_node_count * max_surge_percentage) - Provision new nodes with target Kubernetes version
- Cordon old nodes (prevent new pod scheduling)
- Drain old nodes (evict pods with PodDisruptionBudget awareness)
- Delete old nodes
- Repeat until all nodes upgraded
Max-Surge Configuration Best Practices
# Conservative approach for production (10-20%)
az aks nodepool upgrade \
--max-surge 10% \
--kubernetes-version 1.28.5
# Balanced approach (25-33%)
az aks nodepool upgrade \
--max-surge 33% \
--kubernetes-version 1.28.5
# Aggressive approach for non-critical environments (50-100%)
az aks nodepool upgrade \
--max-surge 100% \
--kubernetes-version 1.28.5
Max-Surge Selection Matrix:
| Environment | Max-Surge | Reasoning |
|---|---|---|
| Production (critical) | 10-20% | Minimizes simultaneous node count changes |
| Production (standard) | 25-33% | Balanced speed vs. stability |
| Staging | 50% | Faster upgrades, acceptable risk |
| Development | 100% | Maximum speed |
Terraform Implementation
resource "azurerm_kubernetes_cluster_node_pool" "system" {
name = "systemnp"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_D4s_v5"
node_count = 3
# Upgrade settings
upgrade_settings {
max_surge = "33%" # or use absolute value like "1"
}
# Ensure PDBs are respected
enable_node_public_ip = false
tags = {
Environment = "Production"
ManagedBy = "Terraform"
}
}
Downtime Risk Analysis
Now, here’s where it gets tricky. Several things can go wrong during a surge upgrade:
- Pod eviction failures: If pods don’t respect SIGTERM or have long grace periods
- PodDisruptionBudget violations: Can block drain operations
- Image pull delays: New nodes pulling large images
- Application startup time: Slow readiness probes delay traffic routing
I’ve learned the hard way that you need to prepare for these scenarios. Here are the mitigation strategies we use:
# 1. Configure proper PodDisruptionBudgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: critical-service
---
# 2. Implement proper lifecycle hooks
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-service
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: myapp:v1.0.0
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Allow time for connection draining
terminationGracePeriodSeconds: 60
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# 3. Configure topology spread constraints
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-service
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: critical-service
Rollback Pattern
Here’s something that surprises people: surge upgrades don’t support automatic rollback. If something goes wrong, you’ll need to intervene manually:
# Identify problematic version
kubectl get nodes -o wide
# Add new node pool with previous version
az aks nodepool add \
--resource-group production-rg \
--cluster-name prod-aks-cluster \
--name rollbacknp \
--kubernetes-version 1.27.9 \
--node-count 3
# Cordon upgraded nodes
kubectl cordon -l agentpool=systemnp
# Drain upgraded nodes
kubectl drain -l agentpool=systemnp \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=60
# Delete problematic node pool
az aks nodepool delete \
--resource-group production-rg \
--cluster-name prod-aks-cluster \
--name systemnp
Cost Implications
During Upgrade:
- Surge of 33% on 10-node pool = 3.3 → 4 additional nodes
- Cost window: Duration of upgrade (typically 15-45 minutes)
- Example: D4s_v5 at $0.192/hour × 4 nodes × 0.5 hours = $0.38
Trade-off: Higher surge = faster completion = lower total cost
Strategy 2: Blue-Green Node Pool Upgrades
Architecture Pattern
This is where things get more interesting. The blue-green node pool strategy creates a completely new node pool and lets you migrate workloads in a controlled, deliberate way.
# Step 1: Create green node pool with new version
az aks nodepool add \
--resource-group production-rg \
--cluster-name prod-aks-cluster \
--name greennp \
--kubernetes-version 1.28.5 \
--node-count 3 \
--labels environment=green deployment=active \
--node-taints deployment=green:NoSchedule # Prevent automatic scheduling
Migration Workflow
# Step 2: Remove taint to allow scheduling
kubectl taint nodes -l deployment=green deployment=green:NoSchedule-
# Step 3: Cordon blue nodes
kubectl cordon -l environment=blue
# Step 4: Controlled drain with monitoring
for node in $(kubectl get nodes -l environment=blue -o name); do
echo "Draining $node"
kubectl drain $node \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=120 \
--timeout=10m
# Validate pod distribution after each drain
kubectl get pods -o wide | grep -v "green"
# Pause for observation
sleep 60
done
# Step 5: Verify all workloads running on green
kubectl get pods -o wide --all-namespaces | grep -c greennp
# Step 6: Delete blue node pool
az aks nodepool delete \
--resource-group production-rg \
--cluster-name prod-aks-cluster \
--name bluenp
Terraform Implementation
locals {
active_version = "1.28.5"
standby_version = "1.27.9"
is_green_active = true # Toggle for cutover
}
resource "azurerm_kubernetes_cluster_node_pool" "blue" {
count = local.is_green_active ? 0 : 1
name = "bluenp"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_D4s_v5"
node_count = 3
orchestrator_version = local.active_version
node_labels = {
environment = "blue"
deployment = "active"
}
}
resource "azurerm_kubernetes_cluster_node_pool" "green" {
count = local.is_green_active ? 1 : 0
name = "greennp"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_D4s_v5"
node_count = 3
orchestrator_version = local.active_version
node_labels = {
environment = "green"
deployment = "active"
}
node_taints = [
"deployment=green:NoSchedule" # Initial taint for controlled migration
]
}
Advanced: Using Node Selectors for Gradual Migration
apiVersion: apps/v1
kind: Deployment
metadata:
name: canary-workload
spec:
replicas: 6
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: environment
operator: In
values:
- green
# Fallback to blue if green not available
tolerations:
- key: deployment
operator: Equal
value: green
effect: NoSchedule
Rollback Pattern
# Immediate rollback: Switch traffic back to blue pool
kubectl uncordon -l environment=blue
kubectl cordon -l environment=green
# Re-taint green to force migration back
kubectl taint nodes -l environment=green deployment=green:NoSchedule
# Drain green nodes
kubectl drain -l environment=green \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60
Cost Implications
Let’s talk money. During the migration window:
- 100% additional capacity (both blue and green pools active)
- Example: 10 D4s_v5 nodes doubled = 20 nodes
- Cost: $0.192/hour × 20 nodes = $3.84/hour
- Migration window: 2-4 hours typical = $7.68-$15.36
Trade-off: Yes, it costs more, but you get zero-risk rollback capability. In my experience, that peace of mind is worth every penny for production systems.
Strategy 3: Blue-Green Cluster Upgrades
When to Use This Approach
Now we’re talking about the heavyweight option. I only recommend blue-green cluster strategy for:
- Major version upgrades (e.g., 1.26 → 1.28)
- Compliance requirements mandating isolated validation
- Control plane changes requiring API server updates
- Complete infrastructure refresh
Architecture Pattern
# Step 1: Provision green cluster with IaC
az aks create \
--resource-group production-rg \
--name prod-aks-green \
--kubernetes-version 1.28.5 \
--node-count 3 \
--network-plugin azure \
--vnet-subnet-id /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/aks-green-subnet \
--load-balancer-sku standard \
--enable-managed-identity
Complete Terraform Implementation
module "aks_green" {
source = "./modules/aks-cluster"
cluster_name = "prod-aks-green"
kubernetes_version = "1.28.5"
resource_group_name = azurerm_resource_group.aks.name
location = azurerm_resource_group.aks.location
subnet_id = azurerm_subnet.aks_green.id
default_node_pool = {
name = "system"
node_count = 3
vm_size = "Standard_D4s_v5"
}
# Copy identity and RBAC from blue cluster
identity_type = "SystemAssigned"
# Network configuration
network_plugin = "azure"
load_balancer_sku = "standard"
outbound_type = "loadBalancer"
tags = {
Environment = "Production"
Cluster = "Green"
}
}
# Traffic Manager for gradual cutover
resource "azurerm_traffic_manager_profile" "aks" {
name = "aks-traffic-manager"
resource_group_name = azurerm_resource_group.aks.name
traffic_routing_method = "Weighted"
dns_config {
relative_name = "prod-aks"
ttl = 60
}
monitor_config {
protocol = "HTTPS"
port = 443
path = "/healthz"
}
}
resource "azurerm_traffic_manager_endpoint" "blue" {
name = "blue-cluster"
resource_group_name = azurerm_resource_group.aks.name
profile_name = azurerm_traffic_manager_profile.aks.name
type = "azureEndpoints"
target_resource_id = azurerm_public_ip.aks_blue_lb.id
weight = var.blue_weight # Start 100, decrease to 0
}
resource "azurerm_traffic_manager_endpoint" "green" {
name = "green-cluster"
resource_group_name = azurerm_resource_group.aks.name
profile_name = azurerm_traffic_manager_profile.aks.name
type = "azureEndpoints"
target_resource_id = azurerm_public_ip.aks_green_lb.id
weight = var.green_weight # Start 0, increase to 100
}
Migration Strategy
# Step 1: Deploy applications to green cluster
kubectl config use-context prod-aks-green
kubectl apply -f ./manifests/
# Step 2: Validate green cluster
kubectl get pods --all-namespaces
kubectl run test-pod --image=curlimages/curl --rm -it -- \
curl -v http://internal-service.default.svc.cluster.local
# Step 3: Gradual traffic shift via Traffic Manager
# Update terraform variables
blue_weight = 90 # Start 100
green_weight = 10 # Start 0
terraform apply
# Monitor metrics and logs
# Repeat traffic shift in increments: 80/20, 50/50, 20/80, 0/100
Data Migration Considerations
For stateful workloads:
# Use VolumeSnapshot for persistent data
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: mysql-snapshot
spec:
volumeSnapshotClassName: csi-azuredisk-vsc
source:
persistentVolumeClaimName: mysql-pvc
---
# Restore in green cluster
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-pvc
spec:
dataSource:
name: mysql-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: managed-csi-premium
Rollback Pattern
# Immediate rollback: Shift all traffic back to blue
terraform apply -var="blue_weight=100" -var="green_weight=0"
# Or via Azure CLI
az network traffic-manager endpoint update \
--resource-group production-rg \
--profile-name aks-traffic-manager \
--name blue-cluster \
--weight 100
az network traffic-manager endpoint update \
--resource-group production-rg \
--profile-name aks-traffic-manager \
--name green-cluster \
--weight 0
Cost Implications
I won’t sugarcoat it: this approach is expensive. During the migration window:
- 100% infrastructure duplication (clusters, load balancers, IPs)
- Example: 10-node cluster = 20 nodes + 2 load balancers + 2 public IPs
- Cost: (~$3.84/hour for nodes) + ($0.025/hour × 2 LBs) + ($0.004/hour × 2 IPs)
- Total: ~$4/hour
- Migration window: 1-7 days typical = $96-$672
Trade-off: It’s the most expensive option, but you get complete isolation and validation capability. For major upgrades or highly regulated environments, it’s often the only responsible choice.
Decision Matrix
| Criteria | Surge Upgrade | Blue-Green Node Pool | Blue-Green Cluster |
|---|---|---|---|
| Cost Impact | Low ($0.38) | Medium ($7-15) | High ($96-672) |
| Downtime Risk | Low-Medium | Very Low | None |
| Rollback Speed | Slow (30-60 min) | Fast (5-10 min) | Instant |
| Validation Window | None | Limited | Complete |
| Complexity | Low | Medium | High |
| Use Case | Patch upgrades | Minor upgrades | Major upgrades |
| Skill Level Required | Basic | Intermediate | Advanced |
Production Recommendations
So which strategy should you use? Here’s what I recommend based on real-world experience:
For Patch Upgrades (1.27.5 → 1.27.9)
- Use: Surge upgrade with 33% max-surge
- Reasoning: These are usually just security patches with minimal API changes. The risk is low enough that surge works great.
For Minor Upgrades (1.27 → 1.28)
- Use: Blue-Green node pool
- Reasoning: You might hit API deprecations, and you’ll want that validation window before fully committing.
For Major Upgrades (1.26 → 1.28)
- Use: Blue-Green cluster
- Reasoning: Skipping versions means significant API changes. You need that complete isolation to validate everything works before cutting over.
Pre-Upgrade Checklist
Regardless of strategy:
# 1. Check deprecated APIs
kubectl-convert --output-version apps/v1 -f old-deployment.yaml
# 2. Verify PodDisruptionBudgets
kubectl get pdb --all-namespaces
# 3. Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
# 4. Review AKS release notes
az aks get-versions --location eastus --output table
# 5. Backup cluster state
kubectl cluster-info dump --output-directory=/backup/$(date +%Y%m%d)
# 6. Test in non-production first
# 7. Schedule during maintenance window
# 8. Ensure monitoring and alerting active
Conclusion
Choosing the right upgrade strategy isn’t about picking the “best” option—it’s about matching the approach to your specific situation. What’s your risk tolerance? What’s your budget? How complex can your operations handle?
In my experience, most production environments benefit from a layered approach: surge for patches, blue-green node pools for minor upgrades, and blue-green clusters for major version changes or compliance-critical scenarios. Once you understand the technical mechanics, cost implications, and rollback patterns of each strategy, you can make informed decisions that balance operational safety with resource efficiency.
The worst upgrade is the one you avoid doing. Pick the right strategy for your context, prepare thoroughly, and you’ll be fine.
About StriveNimbus: We specialize in Azure Kubernetes Service architecture, migration strategies, and operational excellence. Our team has executed hundreds of AKS upgrades across diverse production environments. Contact us for AKS upgrade planning and implementation support.