End-to-End Platform Engineering on Azure Using AKS and ArgoCD

How StriveNimbus modernized a SaaS company's delivery model with AKS and GitOps via ArgoCD—achieving zero environment drift, 85% faster deployments, and complete audit visibility.

Executive Summary

I’ve seen this pattern repeat itself across dozens of engineering organizations: talented teams slowed down by manual provisioning, environment drift, and complete lack of deployment visibility. When a mid-sized SaaS company reached out to us, they were experiencing exactly these symptoms—and their velocity was suffering for it.

We helped them build a modern platform engineering foundation using AKS (Azure Kubernetes Service) and ArgoCD for GitOps automation. The results speak for themselves: zero configuration drift, 85% faster deployments, and complete audit visibility that transformed how their teams ship software.

Key Outcomes:

  • Environment provisioning: 2-3 days → 15 minutes (automated)
  • Configuration drift incidents: 12/quarter → 0/quarter
  • Mean time to deployment: 4 hours → 35 minutes
  • Deployment rollback time: 45 minutes → 2 minutes
  • Failed deployments due to config errors: 18/month → 1/month
  • Infrastructure-related support tickets: -72%
  • Audit compliance time: 3 days → real-time visibility

Client Background

Industry: Enterprise SaaS (HR Tech)

Team Size: 85 engineers across 12 product teams

Infrastructure Scale:

  • 6 environments (dev, 3x staging, 2x production)
  • ~300 microservices
  • Azure-native stack
  • Monthly active deployments: 800+

Initial State:

When we started working with them, their infrastructure looked like most mid-stage companies I’ve seen:

  • Manual kubectl deployments: Engineers directly applying manifests to clusters
  • Snowflake environments: Each environment had subtle differences nobody could explain
  • YAML scattered everywhere: Kubernetes manifests in 15+ repositories
  • Tribal knowledge: Only 3 people understood the full deployment process
  • No visibility: Developers had no idea what was running where
  • Zero audit trail: Compliance team spent days reconstructing deployment history
  • Alert fatigue: 40% of on-call pages were environment config issues

The VP of Engineering told me something I’ll never forget: “We’re hiring great engineers, then wasting their talent on YAML archaeology and firefighting drift.”

The Challenge: Platform Fragmentation

Let me break down the specific problems they were facing.

Problem 1: Environment Sprawl and Drift

Each environment was a unique snowflake:

# What developers actually had to do
kubectl config use-context dev-cluster
kubectl apply -f service-a.yaml  # Wait, which version?
kubectl apply -f configmap-dev.yaml  # Or was it configmap-development.yaml?

# Different across every environment
# No source of truth
# No rollback strategy
# Manual reconciliation every week

Impact: Configuration drift caused 12 production incidents in Q3 2024 alone. Teams spent more time debugging environment differences than building features.

Specific incident: Production had a ConfigMap value pointing to a staging database. The issue existed for 3 weeks before being discovered. Root cause: manual kubectl apply with wrong context.

Problem 2: No Deployment Visibility

# Typical deployment "process"
$ kubectl apply -f deployment.yaml
deployment.apps/my-service configured

# Success? Failure? Who deployed it? When?
# Nobody knows until something breaks

Questions that took hours to answer:

  • What version of service X is running in staging?
  • Who deployed the change that broke production last night?
  • What’s the diff between dev and prod configurations?
  • Can we rollback to the version from last Tuesday?

Impact: Post-incident reviews took 3-4 hours just gathering deployment history. Rollbacks required tribal knowledge and prayer.

Problem 3: Compliance and Audit Nightmare

Their compliance team needed to answer questions like:

  • “Show me all production deployments in Q3”
  • “Who has permission to deploy to production?”
  • “What was the configuration on August 15th?”

Reality: Stitching together Git logs, Slack messages, and kubectl audit logs took 2-3 days per audit request.

Impact: SOC 2 audit cost $45,000 in engineering time. They were considering not pursuing SOC 2 Type II due to operational burden.

Problem 4: Broken Rollbacks

# How rollbacks actually worked
1. Search Slack for "what was running before?"
2. Find old YAML in Git history (maybe)
3. Hope it's the right version
4. kubectl apply and pray
5. Wait 10-15 minutes to see if it worked
6. Repeat if it didn't

# Average rollback time: 45 minutes
# With incident stress: 90+ minutes

Solution Architecture: GitOps with ArgoCD

We designed a GitOps-first platform architecture that made Git the single source of truth for everything running in Kubernetes.

Architecture Overview

graph TB
    subgraph DevWorkflow["Developer Workflow"]
        dev[Developer]
        pr[Pull Request]
        review[Code Review]
    end

    subgraph GitOps["GitOps Control Plane"]
        gitRepo[Git Repository
Single Source of Truth] main[Main Branch] envOverlays[Environment Overlays
dev staging prod] end subgraph ArgoCD["ArgoCD Layer"] argoServer[ArgoCD Server
UI Dashboard] appController[Application Controller
Reconciliation Loop] repoServer[Repo Server
Manifest Generation] apps[ArgoCD Applications
Per Service Per Environment] end subgraph AKS["Azure Kubernetes Service"] cluster1[AKS Dev
East US] cluster2[AKS Staging
East US] cluster3[AKS Prod
Multi-Region] end subgraph Observability["Observability Stack"] prometheus[Prometheus
Metrics] grafana[Grafana
Dashboards] azMonitor[Azure Monitor
Logs] end dev -->|1. Commit Changes| pr pr -->|2. Merge After Review| gitRepo gitRepo --> main main --> envOverlays envOverlays -->|3. ArgoCD Watches| argoServer argoServer --> appController appController --> repoServer repoServer -->|4. Generate Manifests| apps apps -->|5. Sync to Cluster| cluster1 apps -->|5. Sync to Cluster| cluster2 apps -->|5. Sync to Cluster| cluster3 cluster1 & cluster2 & cluster3 -->|Metrics| prometheus prometheus --> grafana cluster1 & cluster2 & cluster3 -->|Logs| azMonitor argoServer -.->|Real-time Status| dev appController -.->|Health Checks| apps

Core Principles

1. Git as Single Source of Truth

  • Every deployment is a Git commit
  • Rollback = revert commit or point to previous version
  • Audit trail = Git history
  • Approval = PR approval

2. Declarative Configuration

  • Describe desired state, not imperative commands
  • ArgoCD continuously reconciles actual vs desired state
  • Self-healing: ArgoCD reverts manual changes automatically

3. Environment Promotion

  • Changes flow: dev → staging → production
  • Promotion = update Git ref or overlay
  • Same manifests across environments (with overlays)

4. Zero-Trust Cluster Access

  • Developers never kubectl directly to production
  • All changes via Git + ArgoCD
  • Least-privilege RBAC in clusters

Implementation: Building the Platform

Phase 1: GitOps Repository Structure (Week 1)

First, we established the repository structure that would become the source of truth.

platform-gitops/
├── apps/
   ├── base/                          # Base manifests (DRY)
   ├── service-a/
   ├── deployment.yaml
   ├── service.yaml
   ├── configmap.yaml
   └── kustomization.yaml
   ├── service-b/
   └── service-c/
   └── overlays/                      # Environment-specific
       ├── dev/
   ├── service-a/
   └── kustomization.yaml  # Patches for dev
   └── kustomization.yaml
       ├── staging/
       └── production/
├── infrastructure/
   ├── base/
   ├── ingress-nginx/
   ├── cert-manager/
   └── monitoring/
   └── overlays/
       ├── dev/
       ├── staging/
       └── production/
├── argocd/
   ├── applications/                  # ArgoCD Application CRDs
   ├── service-a-dev.yaml
   ├── service-a-staging.yaml
   ├── service-a-prod.yaml
   └── app-of-apps.yaml          # Parent application
   └── projects/                      # ArgoCD Projects (RBAC)
       ├── team-platform.yaml
       ├── team-backend.yaml
       └── team-frontend.yaml
└── README.md

Key decisions:

  • Kustomize over Helm: More transparent, easier to review in PRs
  • Monorepo: Single repo for all environments (easier to promote changes)
  • App-of-Apps pattern: Parent ArgoCD app manages child apps (bootstrap automation)

Phase 2: AKS Cluster Setup with Terraform (Week 1-2)

# terraform/aks-clusters/main.tf
module "aks_dev" {
  source = "./modules/aks-cluster"

  environment         = "dev"
  location           = "eastus"
  resource_group_name = "rg-platform-dev"

  node_pools = {
    system = {
      vm_size    = "Standard_D4s_v5"
      node_count = 2
      min_count  = 2
      max_count  = 4
    }
    apps = {
      vm_size    = "Standard_D8s_v5"
      node_count = 3
      min_count  = 3
      max_count  = 10
    }
  }

  enable_oidc_issuer       = true
  enable_workload_identity = true

  # Network configuration
  vnet_subnet_id = azurerm_subnet.aks_dev.id
  network_plugin = "azure"
  network_policy = "azure"

  # Enable Azure Monitor
  oms_agent_enabled = true
  log_analytics_workspace_id = azurerm_log_analytics_workspace.aks.id

  tags = {
    Environment = "Development"
    ManagedBy   = "Terraform"
    GitOpsRepo  = "platform-gitops"
  }
}

# Staging cluster (similar config)
module "aks_staging" {
  source = "./modules/aks-cluster"
  # ... staging config
}

# Production cluster (multi-zone, larger)
module "aks_production" {
  source = "./modules/aks-cluster"

  environment = "production"
  location    = "eastus"

  node_pools = {
    system = {
      vm_size             = "Standard_D8s_v5"
      node_count          = 3
      min_count           = 3
      max_count           = 6
      availability_zones  = ["1", "2", "3"]
    }
    apps = {
      vm_size             = "Standard_D16s_v5"
      node_count          = 6
      min_count           = 6
      max_count           = 20
      availability_zones  = ["1", "2", "3"]
    }
  }

  # Production-specific settings
  enable_private_cluster = true
  api_server_authorized_ip_ranges = ["10.0.0.0/16"]

  # ... rest of prod config
}

Phase 3: ArgoCD Installation (Week 2)

Install ArgoCD via Helm

# Create namespace
kubectl create namespace argocd

# Add ArgoCD Helm repository
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update

# Install ArgoCD with custom values
helm install argocd argo/argo-cd \
  --namespace argocd \
  --version 6.0.0 \
  --values argocd-values.yaml

# Wait for rollout
kubectl rollout status -n argocd deployment/argocd-server
kubectl rollout status -n argocd deployment/argocd-repo-server
kubectl rollout status -n argocd deployment/argocd-application-controller

ArgoCD Custom Values

# argocd-values.yaml
global:
  domain: argocd.company.com

server:
  replicas: 2

  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - argocd.company.com
    tls:
      - secretName: argocd-tls
        hosts:
          - argocd.company.com

  config:
    url: https://argocd.company.com

    # SSO with Azure AD
    dex.config: |
      connectors:
      - type: microsoft
        id: microsoft
        name: Microsoft
        config:
          clientID: $AZURE_AD_CLIENT_ID
          clientSecret: $AZURE_AD_CLIENT_SECRET
          redirectURI: https://argocd.company.com/api/dex/callback
          tenant: <tenant-id>

  rbacConfig:
    policy.default: role:readonly
    policy.csv: |
      # Platform team = full access
      g, platform-team, role:admin

      # Backend team = deploy backend services
      p, role:backend-deployer, applications, *, backend/*, allow
      g, backend-team, role:backend-deployer

      # Frontend team = deploy frontend services
      p, role:frontend-deployer, applications, *, frontend/*, allow
      g, frontend-team, role:frontend-deployer

controller:
  replicas: 2

  metrics:
    enabled: true
    serviceMonitor:
      enabled: true

repoServer:
  replicas: 2

  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi

Access ArgoCD UI

# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 -d

# Port-forward to access UI (if ingress not set up yet)
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Access at https://localhost:8080
# Username: admin
# Password: <from above command>

Phase 4: ArgoCD Projects and Applications (Week 3)

Create ArgoCD Projects for Team Isolation

# argocd/projects/team-backend.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: backend
  namespace: argocd
spec:
  description: Backend team services

  sourceRepos:
    - 'https://github.com/company/platform-gitops'

  destinations:
    - namespace: 'backend-*'
      server: '*'

  clusterResourceWhitelist:
    - group: ''
      kind: Namespace

  namespaceResourceWhitelist:
    - group: 'apps'
      kind: Deployment
    - group: ''
      kind: Service
    - group: ''
      kind: ConfigMap
    - group: ''
      kind: Secret

  roles:
    - name: deployer
      description: Backend team deployers
      policies:
        - p, proj:backend:deployer, applications, *, backend/*, allow
      groups:
        - backend-team

Create ArgoCD Applications

# argocd/applications/service-a-dev.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: service-a-dev
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: backend

  source:
    repoURL: https://github.com/company/platform-gitops
    targetRevision: HEAD
    path: apps/overlays/dev/service-a

  destination:
    server: https://aks-dev.eastus.cloudapp.azure.com
    namespace: backend-dev

  syncPolicy:
    automated:
      prune: true      # Delete resources not in Git
      selfHeal: true   # Revert manual changes
      allowEmpty: false

    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground

    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

  revisionHistoryLimit: 10

App-of-Apps Pattern (Bootstrap All Services)

# argocd/applications/app-of-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app-of-apps
  namespace: argocd
spec:
  project: default

  source:
    repoURL: https://github.com/company/platform-gitops
    targetRevision: HEAD
    path: argocd/applications

  destination:
    server: https://kubernetes.default.svc
    namespace: argocd

  syncPolicy:
    automated:
      prune: true
      selfHeal: true

How it works:

  1. Deploy app-of-apps application manually (once)
  2. app-of-apps reads argocd/applications/ directory
  3. Creates all child Application CRDs automatically
  4. Add new service → commit new Application YAML → ArgoCD picks it up

Phase 5: Service Deployment with Kustomize (Week 3-4)

Base Manifests

# apps/base/service-a/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: service-a
spec:
  replicas: 2
  selector:
    matchLabels:
      app: service-a
  template:
    metadata:
      labels:
        app: service-a
    spec:
      containers:
      - name: app
        image: company.azurecr.io/service-a:v1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: LOG_LEVEL
          value: info
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
# apps/base/service-a/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: service-a
spec:
  selector:
    app: service-a
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP
---
# apps/base/service-a/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml

Environment Overlays

# apps/overlays/dev/service-a/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: backend-dev

bases:
  - ../../../base/service-a

replicas:
  - name: service-a
    count: 1

images:
  - name: company.azurecr.io/service-a
    newTag: latest

patches:
  - target:
      kind: Deployment
      name: service-a
    patch: |-
      - op: replace
        path: /spec/template/spec/containers/0/env/0/value
        value: debug
---
# apps/overlays/production/service-a/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: backend-prod

bases:
  - ../../../base/service-a

replicas:
  - name: service-a
    count: 4

images:
  - name: company.azurecr.io/service-a
    newTag: v1.2.3  # Pinned version in prod

patches:
  - target:
      kind: Deployment
      name: service-a
    patch: |-
      - op: add
        path: /spec/template/spec/affinity
        value:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: service-a
              topologyKey: kubernetes.io/hostname

Phase 6: CI/CD Integration (Week 4)

Azure DevOps Pipeline

# azure-pipelines.yml (in service repositories)
trigger:
  branches:
    include:
      - main

pool:
  vmImage: 'ubuntu-latest'

variables:
  dockerRegistry: 'company.azurecr.io'
  imageName: 'service-a'
  gitopsRepo: 'platform-gitops'

stages:
- stage: Build
  jobs:
  - job: BuildAndPush
    steps:
    - task: Docker@2
      displayName: Build and Push Image
      inputs:
        containerRegistry: 'AzureContainerRegistry'
        repository: '$(imageName)'
        command: 'buildAndPush'
        Dockerfile: '**/Dockerfile'
        tags: |
          $(Build.BuildId)
          latest

- stage: UpdateGitOps
  dependsOn: Build
  jobs:
  - job: UpdateManifests
    steps:
    - checkout: none

    - bash: |
        git clone https://$(GITHUB_TOKEN)@github.com/company/$(gitopsRepo).git
        cd $(gitopsRepo)

        # Update image tag in dev overlay
        cd apps/overlays/dev/$(imageName)
        kustomize edit set image company.azurecr.io/$(imageName):$(Build.BuildId)

        # Commit and push
        git config user.email "ci@company.com"
        git config user.name "Azure Pipelines"
        git add .
        git commit -m "Update $(imageName) to build $(Build.BuildId)"
        git push origin main
      displayName: 'Update GitOps Repository'
      env:
        GITHUB_TOKEN: $(GITHUB_TOKEN)

Deployment flow:

  1. Developer merges PR to service repo
  2. Azure Pipeline builds Docker image
  3. Pipeline updates GitOps repo with new image tag
  4. ArgoCD detects change in Git
  5. ArgoCD syncs new image to cluster
  6. Health checks pass → deployment complete

Phase 7: Observability Integration (Week 5)

Prometheus ServiceMonitor for ArgoCD

# monitoring/argocd-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: argocd
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-metrics
  endpoints:
  - port: metrics
    interval: 30s
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-alerts
  namespace: argocd
spec:
  groups:
  - name: argocd
    interval: 30s
    rules:
    - alert: ArgoAppOutOfSync
      expr: |
        argocd_app_info{sync_status="OutOfSync"} == 1
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "ArgoCD app {{ $labels.name }} out of sync"
        description: "Application has been out of sync for 15 minutes"

    - alert: ArgoAppUnhealthy
      expr: |
        argocd_app_info{health_status!="Healthy"} == 1
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "ArgoCD app {{ $labels.name }} unhealthy"
        description: "Application health status: {{ $labels.health_status }}"

Grafana Dashboard

# Import ArgoCD dashboard
# Dashboard ID: 14584 (official ArgoCD dashboard)

Results and Impact

Quantitative Improvements

MetricBeforeAfterImprovement
Environment provisioning2-3 days15 minutes99% faster
Config drift incidents12/quarter0/quarterZero drift
Mean time to deployment4 hours35 minutes85% faster
Deployment rollback time45 minutes2 minutes95% faster
Failed deployments18/month1/month94% reduction
Audit compliance time3 daysReal-timeInstant visibility
Platform team tickets250/month70/month72% reduction

Qualitative Wins

GitOps Benefits Realized:

Complete Audit Trail:

# Question: Who deployed what to production on Sept 15?
# Answer: git log --since="2025-09-15" --until="2025-09-16" -- apps/overlays/production/

# Shows: commits, authors, timestamps, diffs
# Compliance team went from 3 days → 5 minutes

Instant Rollbacks:

# Old way: 45 minutes of panic and tribal knowledge
# New way:
argocd app rollback service-a-prod

# Or via Git:
git revert HEAD
git push

# ArgoCD automatically syncs previous version
# Rollback complete in 2 minutes

Self-Healing Infrastructure:

One memorable incident: Junior engineer ran kubectl delete deployment service-b-prod by accident (wrong context).

Before ArgoCD: Outage until someone noticed and re-applied manifests manually.

After ArgoCD: ArgoCD detected drift, automatically recreated deployment in 30 seconds. Service restored before anyone noticed.

Developer Experience:

Developers told us:

  • “I can finally see what’s actually deployed without asking the platform team”
  • “Rollbacks went from terrifying to boring—exactly what you want”
  • “PR reviews now include deployment config changes—we catch issues before prod”

Platform Team:

  • “We went from 80% firefighting to 80% building new capabilities”
  • “On-call went from constant interruptions to actually quiet nights”
  • “We can onboard new engineers in days, not weeks”

Cost Optimization

While not the primary goal, we achieved cost savings:

  • Reduced cluster waste: Better resource utilization (35% reduction in over-provisioned capacity)
  • Platform team efficiency: 3 FTEs redeployed to product work (saved ~$450K/year)
  • Incident reduction: Fewer outages, less revenue impact
  • Compliance automation: $40K saved on SOC 2 audit prep

Key Lessons Learned

1. Git Becomes the Approval Gate

Insight: PR approvals became deployment approvals. This shifted security left.

Implementation:

  • Production overlays require 2 approvals
  • CODEOWNERS file enforces review by appropriate teams
  • GitHub branch protection prevents direct pushes
# .github/CODEOWNERS
apps/overlays/production/**  @platform-team @security-team

Result: Security team has visibility into every prod change before it happens.

2. Start with Dev, Perfect Before Prod

Mistake we almost made: Rolling out ArgoCD to production on day one.

What worked:

  • Week 1-2: Dev environment only
  • Week 3: Staging
  • Week 4: Production (with manual sync initially)
  • Week 5: Enable auto-sync in production

Lesson: Developers need time to internalize GitOps workflows. Staging is where you discover edge cases.

3. Auto-Sync with Self-Heal is Powerful (and Scary)

We debated: should we enable syncPolicy.automated.selfHeal in production?

Concern: What if ArgoCD automatically reverts a legitimate manual hotfix during an incident?

Solution:

  • Enabled self-heal in prod
  • Documented incident procedure: “Suspend ArgoCD sync during active incidents”
# During incident, pause auto-sync
argocd app set service-a-prod --sync-policy=none

# Apply hotfix manually
kubectl apply -f hotfix.yaml

# After incident, commit fix to Git and re-enable sync
git commit -m "Hotfix: increased memory limit"
argocd app set service-a-prod --sync-policy=automated

Result: Self-heal caught 47 accidental manual changes in first 3 months. Zero issues during incidents.

4. ArgoCD Projects Provide RBAC Boundaries

We created separate ArgoCD Projects per team:

backend-project   → backend-team can only deploy to backend-* namespaces
frontend-project  → frontend-team can only deploy to frontend-* namespaces
platform-project  → platform-team has full access

Benefit: Teams have autonomy without risk of accidentally deploying to wrong namespace or cluster.

5. Image Tag Strategy Matters

Anti-pattern: Using :latest tag in production.

What we did:

  • Dev: :latest (fast iteration)
  • Staging: :build-12345 (commit SHA or build ID)
  • Production: :v1.2.3 (semantic version tags)

Promotion flow:

# After staging validation
cd apps/overlays/production/service-a
kustomize edit set image company.azurecr.io/service-a:v1.2.3
git commit -m "Promote service-a v1.2.3 to production"

6. Monitor ArgoCD Itself

ArgoCD is now critical infrastructure. We added:

  • Health checks: Prometheus alerts on controller/repo-server health
  • Sync lag alerts: Alert if ArgoCD hasn’t synced in 10 minutes
  • Webhook monitoring: Alert if Git webhook delivery fails
# Alert if ArgoCD sync is stuck
time() - argocd_app_reconcile_time > 600

7. Handling Secrets with Sealed Secrets

We don’t store secrets in Git (obviously). Options we evaluated:

  1. Azure Key Vault + External Secrets Operator (our choice)
  2. Sealed Secrets (Bitnami)
  3. SOPS with Age encryption
# Example: ExternalSecret referencing Azure Key Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: service-a-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-keyvault
    kind: SecretStore
  target:
    name: service-a-secrets
  data:
  - secretKey: DB_PASSWORD
    remoteRef:
      key: service-a-db-password

ArgoCD syncs the ExternalSecret definition, ESO fetches actual secret from Key Vault.

Future Roadmap

Q2 2025: Progressive Delivery

Integrate Argo Rollouts for canary and blue-green deployments:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: service-a
spec:
  strategy:
    canary:
      steps:
      - setWeight: 20
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

Automated rollback based on Prometheus metrics.

Q3 2025: Multi-Cluster Management

Deploy ArgoCD in hub-and-spoke model:

  • Central ArgoCD instance in management cluster
  • Manages applications across dev, staging, prod clusters
  • Single pane of glass for all environments

Q4 2025: Application Sets

Use ApplicationSets to reduce YAML duplication:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: services
spec:
  generators:
  - list:
      elements:
      - name: service-a
        namespace: backend
      - name: service-b
        namespace: backend
  template:
    metadata:
      name: '{{name}}-prod'
    spec:
      source:
        path: apps/overlays/production/{{name}}
      # ... rest of template

One ApplicationSet generates Applications for all services.

Conclusion

GitOps with ArgoCD transformed this company’s delivery model from manual and brittle to automated and reliable. The numbers are impressive (85% faster deployments, zero drift, 2-minute rollbacks), but the real win is cultural: Git became the interface for production.

Deployments are no longer scary. They’re boring—in the best possible way. Developers commit to Git, ArgoCD handles the rest. Rollbacks are Git reverts. Audit history is Git log. Configuration drift is impossible because ArgoCD continuously reconciles.

If you’re still running kubectl apply commands against production, you’re one typo away from an outage. GitOps with ArgoCD gives you declarative infrastructure, complete auditability, and automated reconciliation—exactly what modern platforms need.


About StriveNimbus

StriveNimbus specializes in platform engineering, GitOps implementation, and cloud-native architecture for Azure environments. We help organizations build reliable, auditable deployment pipelines that scale.

Ready to implement GitOps? Contact us for a platform assessment and GitOps transformation roadmap.


Technical Appendix

ArgoCD CLI Cheat Sheet

# Login to ArgoCD
argocd login argocd.company.com

# List applications
argocd app list

# Get application details
argocd app get service-a-prod

# Sync application manually
argocd app sync service-a-prod

# Rollback to previous version
argocd app rollback service-a-prod

# View sync history
argocd app history service-a-prod

# View diff between Git and cluster
argocd app diff service-a-prod

# Delete application (remove from cluster)
argocd app delete service-a-prod

# Suspend auto-sync
argocd app set service-a-prod --sync-policy=none

# Re-enable auto-sync
argocd app set service-a-prod --sync-policy=automated

Troubleshooting Common Issues

Issue: Application stuck in “Progressing” state

# Check events
argocd app get service-a-prod

# View logs
kubectl logs -n argocd deployment/argocd-application-controller | grep service-a-prod

# Check sync status
kubectl describe application service-a-prod -n argocd

Issue: ArgoCD not detecting Git changes

# Check webhook delivery in GitHub/Azure DevOps
# Manually refresh
argocd app get service-a-prod --refresh

# Force hard refresh
argocd app get service-a-prod --hard-refresh

Issue: Sync fails with “permission denied”

# Check ArgoCD Project RBAC
kubectl get appproject backend -n argocd -o yaml

# Verify destination cluster is registered
argocd cluster list