Reducing AKS Compute Costs by 25% with Karpenter and Node Auto Provisioning

How StriveNimbus helped a financial services client achieve 25% cost reduction in AKS by replacing static node pools with Karpenter-based dynamic provisioning and spot instances.

Executive Summary

A mid-sized financial services company came to us with a problem I see all the time: they were running critical applications on Azure Kubernetes Service (AKS) and hemorrhaging money due to over-provisioned static node pools. We implemented Karpenter with Azure’s Node Auto Provisioning (NAP) capabilities and helped them achieve a 25% reduction in monthly compute costs—that’s roughly $18,000/month in savings—while actually improving their scaling performance and resource utilization.

Key Outcomes:

  • 25% reduction in AKS compute costs ($72K → $54K monthly)
  • Average CPU utilization improved from 38% to 67%
  • Node scaling time reduced from 5-8 minutes to 45-90 seconds
  • Zero application disruption during migration
  • Established foundation for spot instance adoption (additional 60-70% savings on eligible workloads)

Client Background

Industry: Financial Services

AKS Workload: Trading analytics platform, risk management systems, customer portals

Infrastructure Scale:

  • 3 AKS clusters (dev, staging, production)
  • 45-60 nodes across production
  • ~800 pods during peak hours
  • Mixed workload types (batch jobs, APIs, data processing)

Pain Points:

The classic over-provisioning trap:

  • Static node pools sized for peak loads (running 24/7)
  • Low average CPU utilization (35-40%)
  • Scaling events taking 5-8 minutes (way too slow)
  • Monthly AKS compute spend: $72,000
  • Capacity planning required manual intervention and guesswork

Baseline Architecture

Initial Setup: Static Node Pools

When we first looked at their setup, it was textbook over-provisioning. Their architecture relied entirely on manually configured node pools:

# Terraform configuration - baseline setup
resource "azurerm_kubernetes_cluster" "prod" {
  name                = "prod-aks-cluster"
  resource_group_name = azurerm_resource_group.aks.name
  location            = "eastus"
  kubernetes_version  = "1.27.7"

  default_node_pool {
    name                = "system"
    vm_size            = "Standard_D4s_v5"
    node_count         = 3
    enable_auto_scaling = true
    min_count          = 3
    max_count          = 5
  }
}

# User node pools - over-provisioned for peak capacity
resource "azurerm_kubernetes_cluster_node_pool" "apps" {
  name                  = "apps"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.prod.id
  vm_size              = "Standard_D8s_v5"

  enable_auto_scaling = true
  min_count          = 12  # Provisioned for peak load
  max_count          = 20

  node_labels = {
    workload = "applications"
  }
}

resource "azurerm_kubernetes_cluster_node_pool" "batch" {
  name                  = "batch"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.prod.id
  vm_size              = "Standard_F16s_v2"  # Compute-optimized

  enable_auto_scaling = true
  min_count          = 8
  max_count          = 15

  node_labels = {
    workload = "batch-processing"
  }
}

Observed Metrics (Baseline - 30-day average):

MetricValue
Total nodes (avg)48
Total nodes (peak)62
CPU utilization (avg)38%
Memory utilization (avg)42%
Monthly compute cost$72,000
Scale-up time (p95)7.2 minutes
Wasted capacity~$27,000/month

Solution Architecture: Karpenter with Node Auto Provisioning

Karpenter Overview

If you haven’t worked with Karpenter before, think of it as a smarter cluster autoscaler. Instead of scaling pre-defined node pools, it provisions just-in-time compute resources based on what your pods actually need.

Key Capabilities:

  • Pod-driven provisioning: Launches nodes based on pending pod requirements
  • Bin-packing optimization: Efficiently places pods to minimize node count
  • Diverse instance selection: Chooses optimal VM sizes from allowed types
  • Consolidation: Automatically removes underutilized nodes
  • Fast provisioning: Directly calls Azure APIs (bypasses node pool scaling)

What surprised me most when I first used Karpenter was how much faster it provisions nodes compared to traditional cluster autoscaling. We’re talking 45-90 seconds instead of 5-8 minutes.

Architecture Diagram

graph TD
    subgraph AKS["AKS Cluster"]
        Pods["Pending Pods
(Unschedulable)"] Karpenter["Karpenter Controller
(Watches scheduler)"] NodePool["NodePool CRD
• VM families
• Spot/On-demand
• Constraints"] Pods -->|"1. Scheduling fails"| Karpenter Karpenter -->|"2. Evaluates
requirements"| NodePool end NodePool -->|"3. Selects optimal
instance type"| Azure["Azure APIs
• VM provisioning
• Spot allocation"] Azure -->|"4. Provisions"| Nodes["Provisioned Nodes
• D4s_v5, D8s_v5
• F8s_v2, F16s_v2
• Spot instances"] Nodes -.->|"5. Pods scheduled"| Pods Karpenter -.->|"Consolidation"| Nodes Karpenter -.->|"Deprovisioning"| Nodes style AKS fill:#e1f5ff style Azure fill:#fff3cd style Nodes fill:#d4edda style Pods fill:#f8d7da

Implementation Approach

Phase 1: Karpenter Installation (Week 1)

Install Karpenter via Helm

# Add Karpenter Helm repository
helm repo add karpenter https://charts.karpenter.sh
helm repo update

# Create karpenter namespace
kubectl create namespace karpenter

# Install Karpenter
helm install karpenter karpenter/karpenter \
  --namespace karpenter \
  --set controller.clusterName=prod-aks-cluster \
  --set controller.clusterEndpoint=$(az aks show \
    --resource-group production-rg \
    --name prod-aks-cluster \
    --query fqdn -o tsv) \
  --set serviceAccount.create=true \
  --version 0.31.1

Terraform Implementation

# Karpenter installation via Helm
resource "helm_release" "karpenter" {
  name       = "karpenter"
  repository = "https://charts.karpenter.sh"
  chart      = "karpenter"
  namespace  = kubernetes_namespace.karpenter.metadata[0].name
  version    = "0.31.1"

  set {
    name  = "controller.clusterName"
    value = azurerm_kubernetes_cluster.prod.name
  }

  set {
    name  = "controller.clusterEndpoint"
    value = azurerm_kubernetes_cluster.prod.fqdn
  }

  set {
    name  = "serviceAccount.annotations.azure\\.workload\\.identity/client-id"
    value = azurerm_user_assigned_identity.karpenter.client_id
  }
}

# Managed identity for Karpenter
resource "azurerm_user_assigned_identity" "karpenter" {
  name                = "karpenter-identity"
  resource_group_name = azurerm_resource_group.aks.name
  location            = azurerm_resource_group.aks.location
}

# Grant permissions to manage VMs
resource "azurerm_role_assignment" "karpenter_vm_contributor" {
  scope                = azurerm_resource_group.aks.id
  role_definition_name = "Virtual Machine Contributor"
  principal_id         = azurerm_user_assigned_identity.karpenter.principal_id
}

Phase 2: NodePool Configuration (Week 2)

We created three Karpenter NodePool CRDs optimizing for different workload types:

General Purpose NodePool

# karpenter-nodepool-general.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general-purpose
spec:
  # Template for provisioned nodes
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]  # Start with on-demand for stability
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - Standard_D4s_v5
        - Standard_D8s_v5
        - Standard_D16s_v5
      taints: []
      labels:
        workload: general

  # Limits
  limits:
    cpu: "200"
    memory: 800Gi

  # Consolidation settings
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
    expireAfter: 720h  # 30 days

  # Weight for prioritization
  weight: 10

Compute-Optimized NodePool (Batch Workloads)

# karpenter-nodepool-compute.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: compute-optimized
spec:
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot"]  # Use spot for batch workloads
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - Standard_F8s_v2
        - Standard_F16s_v2
        - Standard_F32s_v2
      taints:
      - key: workload
        value: batch
        effect: NoSchedule
      labels:
        workload: batch

  limits:
    cpu: "500"

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s
    expireAfter: 168h  # 7 days for short-lived batch jobs

  weight: 20

Memory-Optimized NodePool

# karpenter-nodepool-memory.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: memory-optimized
spec:
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - Standard_E4s_v5
        - Standard_E8s_v5
        - Standard_E16s_v5
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      taints:
      - key: workload
        value: memory-intensive
        effect: NoSchedule
      labels:
        workload: memory

  limits:
    memory: 1000Gi

  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 120s

  weight: 15

NodeClass Configuration

# karpenter-nodeclass.yaml
apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
metadata:
  name: default
spec:
  imageFamily: Ubuntu2204  # OS image
  osDiskSizeGB: 128

  # Subnet for node provisioning
  subnetID: /subscriptions/{sub-id}/resourceGroups/production-rg/providers/Microsoft.Network/virtualNetworks/aks-vnet/subnets/aks-nodes

  # Tags for cost allocation
  tags:
    ManagedBy: Karpenter
    Environment: Production
    CostCenter: Engineering

Phase 3: Workload Migration (Weeks 3-4)

Here’s where patience pays off. We didn’t rush this—we migrated workloads gradually to validate Karpenter behavior and catch any issues early:

Step 1: Add Node Affinity to Deployments

# Example: Migrate batch processing workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  replicas: 10
  template:
    spec:
      # Tolerate Karpenter-provisioned nodes
      tolerations:
      - key: workload
        operator: Equal
        value: batch
        effect: NoSchedule

      # Prefer Karpenter nodes
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: karpenter.sh/nodepool
                operator: In
                values:
                - compute-optimized

      containers:
      - name: processor
        image: batch-processor:v2.1.0
        resources:
          requests:
            cpu: "2"
            memory: 4Gi
          limits:
            cpu: "4"
            memory: 8Gi

Step 2: Cordon Static Node Pool

# Gradually drain static node pool
kubectl cordon -l agentpool=batch

# Monitor pod migration to Karpenter nodes
watch kubectl get pods -o wide --field-selector=status.phase=Running

Step 3: Validate and Scale Down

# After successful migration, reduce static node pool min count
az aks nodepool update \
  --resource-group production-rg \
  --cluster-name prod-aks-cluster \
  --name batch \
  --min-count 0 \
  --max-count 5

# Eventually delete static pool
az aks nodepool delete \
  --resource-group production-rg \
  --cluster-name prod-aks-cluster \
  --name batch

Phase 4: Spot Instance Integration (Week 5)

Once we had Karpenter running smoothly, we took it a step further by enabling spot instances for fault-tolerant workloads:

# Update NodePool to allow spot
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: compute-optimized
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]  # Allow both

Spot Instance Adoption Strategy:

# Add toleration for spot interruption
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor-spot
spec:
  template:
    spec:
      tolerations:
      - key: karpenter.sh/capacity-type
        operator: Equal
        value: spot
        effect: NoSchedule

      # Handle spot interruptions gracefully
      terminationGracePeriodSeconds: 120

      containers:
      - name: processor
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "save-checkpoint.sh"]

Results and Impact

Cost Savings Breakdown

Let’s talk numbers. The results were better than we initially projected:

Before Karpenter:

  • Base node pools: 48 nodes × $0.192/hour (D8s_v5 avg) = $9.22/hour
  • Peak capacity: 62 nodes × $0.192/hour = $11.90/hour
  • Average monthly cost: $72,000

After Karpenter (On-Demand):

  • Average nodes: 35 (better bin-packing)
  • Average hourly cost: $6.72/hour (smaller instance mix)
  • Monthly cost: $54,000
  • Savings: $18,000/month (25%)

With Spot Instances (35% of workload):

  • Spot discount: ~70% on eligible workloads
  • Additional savings: $8,500/month
  • Total monthly cost: $45,500
  • Total savings: $26,500/month (37%)

That’s real money that went straight back into the business.

Performance Metrics

MetricBefore KarpenterAfter KarpenterImprovement
Avg CPU utilization38%67%+76% efficiency
Avg memory utilization42%71%+69% efficiency
Node count (avg)4835-27% nodes
Scale-up time (p95)7.2 min1.5 min79% faster
Monthly cost$72,000$54,000-25% cost
Wasted capacity$27,000/month$8,000/month-70% waste

Observability Improvements

We implemented comprehensive monitoring using Prometheus and Azure Monitor:

# Prometheus ServiceMonitor for Karpenter metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: karpenter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
  - port: http-metrics
    interval: 30s

Key Metrics Tracked:

# Node provisioning latency
histogram_quantile(0.95,
  rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)

# Consolidation savings
sum(karpenter_deprovisioning_actions_performed_total)

# Pending pod count
sum(karpenter_provisioner_pending_pods_total)

# Cost per workload (custom metric)
sum by (workload) (
  node_cpu_hourly_cost * on(node) group_left(workload)
  kube_node_labels{label_workload!=""}
)

Azure Monitor Dashboard:

# Create custom dashboard
az monitor metrics list \
  --resource /subscriptions/{sub}/resourceGroups/production-rg/providers/Microsoft.ContainerService/managedClusters/prod-aks-cluster \
  --metric "node_cpu_usage_percentage" "node_memory_usage_percentage"

Lessons Learned

Let me share some hard-won lessons from this implementation.

1. Gradual Migration is Critical

Learning: I can’t emphasize this enough—switching all workloads to Karpenter at once is asking for trouble.

Approach: We migrated 10% of workloads weekly, validated metrics, then moved on to the next batch. Slow and steady wins the race here.

2. Right-Size Resource Requests

Challenge: Here’s something we discovered early on—over-requested CPU/memory completely defeated Karpenter’s bin-packing optimization. Garbage in, garbage out.

Solution: We analyzed actual resource usage via Vertical Pod Autoscaler (VPA) recommendations:

# Install VPA in recommendation mode
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/vertical-pod-autoscaler-0.14.0/vpa-v0.14.0.yaml

# Create VPA for analysis
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  updateMode: "Off"  # Recommendation only

Reduced requests by 30% on average after VPA analysis.

3. Spot Interruption Handling

Challenge: In our initial spot instance rollout, about 5% of batch jobs failed due to spot interruptions. Not ideal.

Solution: We implemented proper checkpointing and retry logic:

# Python batch job with checkpointing
import signal
import sys

def save_checkpoint(state):
    with open('/checkpoint/state.json', 'w') as f:
        json.dump(state, f)

def signal_handler(sig, frame):
    print('Spot interruption detected, saving checkpoint...')
    save_checkpoint(current_state)
    sys.exit(0)

signal.signal(signal.SIGTERM, signal_handler)

4. NodePool Limits Prevent Runaway Costs

Learning: This one’s important—always set conservative limits on Karpenter NodePools. I’ve seen runaway scaling events that would give any CFO a heart attack.

Best Practice: Start with limits 20% above your peak observed capacity. You can always increase them later if needed.

5. Taints and Tolerations for Workload Isolation

Challenge: High-priority workloads competed for nodes with batch jobs.

Solution: Used taints/tolerations to isolate workload classes:

# Critical workload - dedicated nodes
tolerations:
- key: workload
  operator: Equal
  value: critical
  effect: NoSchedule

Future Optimization Roadmap

Q1 2025: Multi-Zone Provisioning

Distribute nodes across availability zones for resilience:

spec:
  template:
    spec:
      requirements:
      - key: topology.kubernetes.io/zone
        operator: In
        values: ["eastus-1", "eastus-2", "eastus-3"]

Q2 2025: GPU Node Provisioning

Extend Karpenter to ML workloads:

spec:
  template:
    spec:
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["Standard_NC6s_v3", "Standard_NC12s_v3"]

Q3 2025: FinOps Integration

Implement chargeback using Kubecost with Karpenter labels:

metadata:
  labels:
    cost-center: engineering
    team: platform
    project: trading-analytics

Conclusion

This project completely transformed how the client thinks about their AKS infrastructure. We reduced monthly compute spend by 25% while actually improving resource utilization and scaling performance—a rare win-win in cloud infrastructure.

The success really came down to a few key factors:

  • Taking a gradual, phased migration approach (no big bang deployments)
  • Comprehensive monitoring and validation at every step
  • Right-sizing resource requests via VPA analysis (this was huge)
  • Strategic use of spot instances for fault-tolerant workloads
  • Proper workload isolation using taints/tolerations

What I’m most proud of is that the client now has a foundation that scales cost-effectively as their platform grows. They’re already talking about extending this to multi-zone provisioning and GPU workload support.

If you’re running static node pools and paying for capacity you don’t need, Karpenter is worth a serious look. The initial migration takes some work, but the ongoing savings and operational improvements make it worth every hour invested.


About StriveNimbus

StriveNimbus specializes in Kubernetes cost optimization, cloud-native architecture, and platform engineering for Azure environments. We help organizations reduce cloud spend while improving reliability and performance.

Ready to optimize your AKS costs? Contact us for a free cost assessment and optimization roadmap.


Technical Appendix

Karpenter vs. Cluster Autoscaler Comparison

FeatureCluster AutoscalerKarpenter
Provisioning speed5-8 minutes45-90 seconds
Instance diversityPre-defined poolsDynamic selection
ConsolidationManualAutomatic
Spot integrationLimitedNative support
Bin-packingBasicAdvanced

Cost Calculation Methodology

# Monthly cost calculation script
def calculate_monthly_cost(node_count, vm_size, hours_per_month=730):
    pricing = {
        'Standard_D4s_v5': 0.192,
        'Standard_D8s_v5': 0.384,
        'Standard_F8s_v2': 0.355,
        'Standard_F16s_v2': 0.710,
    }
    hourly_cost = node_count * pricing.get(vm_size, 0)
    return hourly_cost * hours_per_month

# Before Karpenter
baseline_cost = calculate_monthly_cost(48, 'Standard_D8s_v5')
print(f"Baseline: ${baseline_cost:,.2f}")

# After Karpenter
optimized_cost = calculate_monthly_cost(35, 'Standard_D8s_v5') * 0.85  # Mixed sizes
print(f"Optimized: ${optimized_cost:,.2f}")
print(f"Savings: ${baseline_cost - optimized_cost:,.2f}")