End-to-End Platform Engineering on Azure Using AKS and ArgoCD
How StriveNimbus modernized a SaaS company's delivery model with AKS and GitOps via ArgoCD—achieving zero environment drift, 85% faster deployments, and complete audit visibility.
Executive Summary
I’ve seen this pattern repeat itself across dozens of engineering organizations: talented teams slowed down by manual provisioning, environment drift, and complete lack of deployment visibility. When a mid-sized SaaS company reached out to us, they were experiencing exactly these symptoms—and their velocity was suffering for it.
We helped them build a modern platform engineering foundation using AKS (Azure Kubernetes Service) and ArgoCD for GitOps automation. The results speak for themselves: zero configuration drift, 85% faster deployments, and complete audit visibility that transformed how their teams ship software.
Key Outcomes:
- Environment provisioning: 2-3 days → 15 minutes (automated)
- Configuration drift incidents: 12/quarter → 0/quarter
- Mean time to deployment: 4 hours → 35 minutes
- Deployment rollback time: 45 minutes → 2 minutes
- Failed deployments due to config errors: 18/month → 1/month
- Infrastructure-related support tickets: -72%
- Audit compliance time: 3 days → real-time visibility
Client Background
Industry: Enterprise SaaS (HR Tech)
Team Size: 85 engineers across 12 product teams
Infrastructure Scale:
- 6 environments (dev, 3x staging, 2x production)
- ~300 microservices
- Azure-native stack
- Monthly active deployments: 800+
Initial State:
When we started working with them, their infrastructure looked like most mid-stage companies I’ve seen:
- Manual kubectl deployments: Engineers directly applying manifests to clusters
- Snowflake environments: Each environment had subtle differences nobody could explain
- YAML scattered everywhere: Kubernetes manifests in 15+ repositories
- Tribal knowledge: Only 3 people understood the full deployment process
- No visibility: Developers had no idea what was running where
- Zero audit trail: Compliance team spent days reconstructing deployment history
- Alert fatigue: 40% of on-call pages were environment config issues
The VP of Engineering told me something I’ll never forget: “We’re hiring great engineers, then wasting their talent on YAML archaeology and firefighting drift.”
The Challenge: Platform Fragmentation
Let me break down the specific problems they were facing.
Problem 1: Environment Sprawl and Drift
Each environment was a unique snowflake:
# What developers actually had to do
kubectl config use-context dev-cluster
kubectl apply -f service-a.yaml # Wait, which version?
kubectl apply -f configmap-dev.yaml # Or was it configmap-development.yaml?
# Different across every environment
# No source of truth
# No rollback strategy
# Manual reconciliation every week
Impact: Configuration drift caused 12 production incidents in Q3 2024 alone. Teams spent more time debugging environment differences than building features.
Specific incident: Production had a ConfigMap value pointing to a staging database. The issue existed for 3 weeks before being discovered. Root cause: manual kubectl apply with wrong context.
Problem 2: No Deployment Visibility
# Typical deployment "process"
$ kubectl apply -f deployment.yaml
deployment.apps/my-service configured
# Success? Failure? Who deployed it? When?
# Nobody knows until something breaks
Questions that took hours to answer:
- What version of service X is running in staging?
- Who deployed the change that broke production last night?
- What’s the diff between dev and prod configurations?
- Can we rollback to the version from last Tuesday?
Impact: Post-incident reviews took 3-4 hours just gathering deployment history. Rollbacks required tribal knowledge and prayer.
Problem 3: Compliance and Audit Nightmare
Their compliance team needed to answer questions like:
- “Show me all production deployments in Q3”
- “Who has permission to deploy to production?”
- “What was the configuration on August 15th?”
Reality: Stitching together Git logs, Slack messages, and kubectl audit logs took 2-3 days per audit request.
Impact: SOC 2 audit cost $45,000 in engineering time. They were considering not pursuing SOC 2 Type II due to operational burden.
Problem 4: Broken Rollbacks
# How rollbacks actually worked
1. Search Slack for "what was running before?"
2. Find old YAML in Git history (maybe)
3. Hope it's the right version
4. kubectl apply and pray
5. Wait 10-15 minutes to see if it worked
6. Repeat if it didn't
# Average rollback time: 45 minutes
# With incident stress: 90+ minutes
Solution Architecture: GitOps with ArgoCD
We designed a GitOps-first platform architecture that made Git the single source of truth for everything running in Kubernetes.
Architecture Overview
graph TB
subgraph DevWorkflow["Developer Workflow"]
dev[Developer]
pr[Pull Request]
review[Code Review]
end
subgraph GitOps["GitOps Control Plane"]
gitRepo[Git Repository
Single Source of Truth]
main[Main Branch]
envOverlays[Environment Overlays
dev staging prod]
end
subgraph ArgoCD["ArgoCD Layer"]
argoServer[ArgoCD Server
UI Dashboard]
appController[Application Controller
Reconciliation Loop]
repoServer[Repo Server
Manifest Generation]
apps[ArgoCD Applications
Per Service Per Environment]
end
subgraph AKS["Azure Kubernetes Service"]
cluster1[AKS Dev
East US]
cluster2[AKS Staging
East US]
cluster3[AKS Prod
Multi-Region]
end
subgraph Observability["Observability Stack"]
prometheus[Prometheus
Metrics]
grafana[Grafana
Dashboards]
azMonitor[Azure Monitor
Logs]
end
dev -->|1. Commit Changes| pr
pr -->|2. Merge After Review| gitRepo
gitRepo --> main
main --> envOverlays
envOverlays -->|3. ArgoCD Watches| argoServer
argoServer --> appController
appController --> repoServer
repoServer -->|4. Generate Manifests| apps
apps -->|5. Sync to Cluster| cluster1
apps -->|5. Sync to Cluster| cluster2
apps -->|5. Sync to Cluster| cluster3
cluster1 & cluster2 & cluster3 -->|Metrics| prometheus
prometheus --> grafana
cluster1 & cluster2 & cluster3 -->|Logs| azMonitor
argoServer -.->|Real-time Status| dev
appController -.->|Health Checks| apps
Core Principles
1. Git as Single Source of Truth
- Every deployment is a Git commit
- Rollback = revert commit or point to previous version
- Audit trail = Git history
- Approval = PR approval
2. Declarative Configuration
- Describe desired state, not imperative commands
- ArgoCD continuously reconciles actual vs desired state
- Self-healing: ArgoCD reverts manual changes automatically
3. Environment Promotion
- Changes flow: dev → staging → production
- Promotion = update Git ref or overlay
- Same manifests across environments (with overlays)
4. Zero-Trust Cluster Access
- Developers never kubectl directly to production
- All changes via Git + ArgoCD
- Least-privilege RBAC in clusters
Implementation: Building the Platform
Phase 1: GitOps Repository Structure (Week 1)
First, we established the repository structure that would become the source of truth.
platform-gitops/
├── apps/
│ ├── base/ # Base manifests (DRY)
│ │ ├── service-a/
│ │ │ ├── deployment.yaml
│ │ │ ├── service.yaml
│ │ │ ├── configmap.yaml
│ │ │ └── kustomization.yaml
│ │ ├── service-b/
│ │ └── service-c/
│ └── overlays/ # Environment-specific
│ ├── dev/
│ │ ├── service-a/
│ │ │ └── kustomization.yaml # Patches for dev
│ │ └── kustomization.yaml
│ ├── staging/
│ └── production/
├── infrastructure/
│ ├── base/
│ │ ├── ingress-nginx/
│ │ ├── cert-manager/
│ │ └── monitoring/
│ └── overlays/
│ ├── dev/
│ ├── staging/
│ └── production/
├── argocd/
│ ├── applications/ # ArgoCD Application CRDs
│ │ ├── service-a-dev.yaml
│ │ ├── service-a-staging.yaml
│ │ ├── service-a-prod.yaml
│ │ └── app-of-apps.yaml # Parent application
│ └── projects/ # ArgoCD Projects (RBAC)
│ ├── team-platform.yaml
│ ├── team-backend.yaml
│ └── team-frontend.yaml
└── README.md
Key decisions:
- Kustomize over Helm: More transparent, easier to review in PRs
- Monorepo: Single repo for all environments (easier to promote changes)
- App-of-Apps pattern: Parent ArgoCD app manages child apps (bootstrap automation)
Phase 2: AKS Cluster Setup with Terraform (Week 1-2)
# terraform/aks-clusters/main.tf
module "aks_dev" {
source = "./modules/aks-cluster"
environment = "dev"
location = "eastus"
resource_group_name = "rg-platform-dev"
node_pools = {
system = {
vm_size = "Standard_D4s_v5"
node_count = 2
min_count = 2
max_count = 4
}
apps = {
vm_size = "Standard_D8s_v5"
node_count = 3
min_count = 3
max_count = 10
}
}
enable_oidc_issuer = true
enable_workload_identity = true
# Network configuration
vnet_subnet_id = azurerm_subnet.aks_dev.id
network_plugin = "azure"
network_policy = "azure"
# Enable Azure Monitor
oms_agent_enabled = true
log_analytics_workspace_id = azurerm_log_analytics_workspace.aks.id
tags = {
Environment = "Development"
ManagedBy = "Terraform"
GitOpsRepo = "platform-gitops"
}
}
# Staging cluster (similar config)
module "aks_staging" {
source = "./modules/aks-cluster"
# ... staging config
}
# Production cluster (multi-zone, larger)
module "aks_production" {
source = "./modules/aks-cluster"
environment = "production"
location = "eastus"
node_pools = {
system = {
vm_size = "Standard_D8s_v5"
node_count = 3
min_count = 3
max_count = 6
availability_zones = ["1", "2", "3"]
}
apps = {
vm_size = "Standard_D16s_v5"
node_count = 6
min_count = 6
max_count = 20
availability_zones = ["1", "2", "3"]
}
}
# Production-specific settings
enable_private_cluster = true
api_server_authorized_ip_ranges = ["10.0.0.0/16"]
# ... rest of prod config
}
Phase 3: ArgoCD Installation (Week 2)
Install ArgoCD via Helm
# Create namespace
kubectl create namespace argocd
# Add ArgoCD Helm repository
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
# Install ArgoCD with custom values
helm install argocd argo/argo-cd \
--namespace argocd \
--version 6.0.0 \
--values argocd-values.yaml
# Wait for rollout
kubectl rollout status -n argocd deployment/argocd-server
kubectl rollout status -n argocd deployment/argocd-repo-server
kubectl rollout status -n argocd deployment/argocd-application-controller
ArgoCD Custom Values
# argocd-values.yaml
global:
domain: argocd.company.com
server:
replicas: 2
ingress:
enabled: true
ingressClassName: nginx
hosts:
- argocd.company.com
tls:
- secretName: argocd-tls
hosts:
- argocd.company.com
config:
url: https://argocd.company.com
# SSO with Azure AD
dex.config: |
connectors:
- type: microsoft
id: microsoft
name: Microsoft
config:
clientID: $AZURE_AD_CLIENT_ID
clientSecret: $AZURE_AD_CLIENT_SECRET
redirectURI: https://argocd.company.com/api/dex/callback
tenant: <tenant-id>
rbacConfig:
policy.default: role:readonly
policy.csv: |
# Platform team = full access
g, platform-team, role:admin
# Backend team = deploy backend services
p, role:backend-deployer, applications, *, backend/*, allow
g, backend-team, role:backend-deployer
# Frontend team = deploy frontend services
p, role:frontend-deployer, applications, *, frontend/*, allow
g, frontend-team, role:frontend-deployer
controller:
replicas: 2
metrics:
enabled: true
serviceMonitor:
enabled: true
repoServer:
replicas: 2
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
Access ArgoCD UI
# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d
# Port-forward to access UI (if ingress not set up yet)
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Access at https://localhost:8080
# Username: admin
# Password: <from above command>
Phase 4: ArgoCD Projects and Applications (Week 3)
Create ArgoCD Projects for Team Isolation
# argocd/projects/team-backend.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: backend
namespace: argocd
spec:
description: Backend team services
sourceRepos:
- 'https://github.com/company/platform-gitops'
destinations:
- namespace: 'backend-*'
server: '*'
clusterResourceWhitelist:
- group: ''
kind: Namespace
namespaceResourceWhitelist:
- group: 'apps'
kind: Deployment
- group: ''
kind: Service
- group: ''
kind: ConfigMap
- group: ''
kind: Secret
roles:
- name: deployer
description: Backend team deployers
policies:
- p, proj:backend:deployer, applications, *, backend/*, allow
groups:
- backend-team
Create ArgoCD Applications
# argocd/applications/service-a-dev.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: service-a-dev
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: backend
source:
repoURL: https://github.com/company/platform-gitops
targetRevision: HEAD
path: apps/overlays/dev/service-a
destination:
server: https://aks-dev.eastus.cloudapp.azure.com
namespace: backend-dev
syncPolicy:
automated:
prune: true # Delete resources not in Git
selfHeal: true # Revert manual changes
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
revisionHistoryLimit: 10
App-of-Apps Pattern (Bootstrap All Services)
# argocd/applications/app-of-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: app-of-apps
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/platform-gitops
targetRevision: HEAD
path: argocd/applications
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
How it works:
- Deploy
app-of-appsapplication manually (once) app-of-appsreadsargocd/applications/directory- Creates all child Application CRDs automatically
- Add new service → commit new Application YAML → ArgoCD picks it up
Phase 5: Service Deployment with Kustomize (Week 3-4)
Base Manifests
# apps/base/service-a/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: service-a
spec:
replicas: 2
selector:
matchLabels:
app: service-a
template:
metadata:
labels:
app: service-a
spec:
containers:
- name: app
image: company.azurecr.io/service-a:v1.0.0
ports:
- containerPort: 8080
env:
- name: LOG_LEVEL
value: info
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# apps/base/service-a/service.yaml
apiVersion: v1
kind: Service
metadata:
name: service-a
spec:
selector:
app: service-a
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
# apps/base/service-a/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
Environment Overlays
# apps/overlays/dev/service-a/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: backend-dev
bases:
- ../../../base/service-a
replicas:
- name: service-a
count: 1
images:
- name: company.azurecr.io/service-a
newTag: latest
patches:
- target:
kind: Deployment
name: service-a
patch: |-
- op: replace
path: /spec/template/spec/containers/0/env/0/value
value: debug
---
# apps/overlays/production/service-a/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: backend-prod
bases:
- ../../../base/service-a
replicas:
- name: service-a
count: 4
images:
- name: company.azurecr.io/service-a
newTag: v1.2.3 # Pinned version in prod
patches:
- target:
kind: Deployment
name: service-a
patch: |-
- op: add
path: /spec/template/spec/affinity
value:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: service-a
topologyKey: kubernetes.io/hostname
Phase 6: CI/CD Integration (Week 4)
Azure DevOps Pipeline
# azure-pipelines.yml (in service repositories)
trigger:
branches:
include:
- main
pool:
vmImage: 'ubuntu-latest'
variables:
dockerRegistry: 'company.azurecr.io'
imageName: 'service-a'
gitopsRepo: 'platform-gitops'
stages:
- stage: Build
jobs:
- job: BuildAndPush
steps:
- task: Docker@2
displayName: Build and Push Image
inputs:
containerRegistry: 'AzureContainerRegistry'
repository: '$(imageName)'
command: 'buildAndPush'
Dockerfile: '**/Dockerfile'
tags: |
$(Build.BuildId)
latest
- stage: UpdateGitOps
dependsOn: Build
jobs:
- job: UpdateManifests
steps:
- checkout: none
- bash: |
git clone https://$(GITHUB_TOKEN)@github.com/company/$(gitopsRepo).git
cd $(gitopsRepo)
# Update image tag in dev overlay
cd apps/overlays/dev/$(imageName)
kustomize edit set image company.azurecr.io/$(imageName):$(Build.BuildId)
# Commit and push
git config user.email "ci@company.com"
git config user.name "Azure Pipelines"
git add .
git commit -m "Update $(imageName) to build $(Build.BuildId)"
git push origin main
displayName: 'Update GitOps Repository'
env:
GITHUB_TOKEN: $(GITHUB_TOKEN)
Deployment flow:
- Developer merges PR to service repo
- Azure Pipeline builds Docker image
- Pipeline updates GitOps repo with new image tag
- ArgoCD detects change in Git
- ArgoCD syncs new image to cluster
- Health checks pass → deployment complete
Phase 7: Observability Integration (Week 5)
Prometheus ServiceMonitor for ArgoCD
# monitoring/argocd-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: argocd-metrics
namespace: argocd
spec:
selector:
matchLabels:
app.kubernetes.io/name: argocd-metrics
endpoints:
- port: metrics
interval: 30s
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: argocd-alerts
namespace: argocd
spec:
groups:
- name: argocd
interval: 30s
rules:
- alert: ArgoAppOutOfSync
expr: |
argocd_app_info{sync_status="OutOfSync"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "ArgoCD app {{ $labels.name }} out of sync"
description: "Application has been out of sync for 15 minutes"
- alert: ArgoAppUnhealthy
expr: |
argocd_app_info{health_status!="Healthy"} == 1
for: 10m
labels:
severity: critical
annotations:
summary: "ArgoCD app {{ $labels.name }} unhealthy"
description: "Application health status: {{ $labels.health_status }}"
Grafana Dashboard
# Import ArgoCD dashboard
# Dashboard ID: 14584 (official ArgoCD dashboard)
Results and Impact
Quantitative Improvements
| Metric | Before | After | Improvement |
|---|---|---|---|
| Environment provisioning | 2-3 days | 15 minutes | 99% faster |
| Config drift incidents | 12/quarter | 0/quarter | Zero drift |
| Mean time to deployment | 4 hours | 35 minutes | 85% faster |
| Deployment rollback time | 45 minutes | 2 minutes | 95% faster |
| Failed deployments | 18/month | 1/month | 94% reduction |
| Audit compliance time | 3 days | Real-time | Instant visibility |
| Platform team tickets | 250/month | 70/month | 72% reduction |
Qualitative Wins
GitOps Benefits Realized:
Complete Audit Trail:
# Question: Who deployed what to production on Sept 15?
# Answer: git log --since="2025-09-15" --until="2025-09-16" -- apps/overlays/production/
# Shows: commits, authors, timestamps, diffs
# Compliance team went from 3 days → 5 minutes
Instant Rollbacks:
# Old way: 45 minutes of panic and tribal knowledge
# New way:
argocd app rollback service-a-prod
# Or via Git:
git revert HEAD
git push
# ArgoCD automatically syncs previous version
# Rollback complete in 2 minutes
Self-Healing Infrastructure:
One memorable incident: Junior engineer ran kubectl delete deployment service-b-prod by accident (wrong context).
Before ArgoCD: Outage until someone noticed and re-applied manifests manually.
After ArgoCD: ArgoCD detected drift, automatically recreated deployment in 30 seconds. Service restored before anyone noticed.
Developer Experience:
Developers told us:
- “I can finally see what’s actually deployed without asking the platform team”
- “Rollbacks went from terrifying to boring—exactly what you want”
- “PR reviews now include deployment config changes—we catch issues before prod”
Platform Team:
- “We went from 80% firefighting to 80% building new capabilities”
- “On-call went from constant interruptions to actually quiet nights”
- “We can onboard new engineers in days, not weeks”
Cost Optimization
While not the primary goal, we achieved cost savings:
- Reduced cluster waste: Better resource utilization (35% reduction in over-provisioned capacity)
- Platform team efficiency: 3 FTEs redeployed to product work (saved ~$450K/year)
- Incident reduction: Fewer outages, less revenue impact
- Compliance automation: $40K saved on SOC 2 audit prep
Key Lessons Learned
1. Git Becomes the Approval Gate
Insight: PR approvals became deployment approvals. This shifted security left.
Implementation:
- Production overlays require 2 approvals
- CODEOWNERS file enforces review by appropriate teams
- GitHub branch protection prevents direct pushes
# .github/CODEOWNERS
apps/overlays/production/** @platform-team @security-team
Result: Security team has visibility into every prod change before it happens.
2. Start with Dev, Perfect Before Prod
Mistake we almost made: Rolling out ArgoCD to production on day one.
What worked:
- Week 1-2: Dev environment only
- Week 3: Staging
- Week 4: Production (with manual sync initially)
- Week 5: Enable auto-sync in production
Lesson: Developers need time to internalize GitOps workflows. Staging is where you discover edge cases.
3. Auto-Sync with Self-Heal is Powerful (and Scary)
We debated: should we enable syncPolicy.automated.selfHeal in production?
Concern: What if ArgoCD automatically reverts a legitimate manual hotfix during an incident?
Solution:
- Enabled self-heal in prod
- Documented incident procedure: “Suspend ArgoCD sync during active incidents”
# During incident, pause auto-sync
argocd app set service-a-prod --sync-policy=none
# Apply hotfix manually
kubectl apply -f hotfix.yaml
# After incident, commit fix to Git and re-enable sync
git commit -m "Hotfix: increased memory limit"
argocd app set service-a-prod --sync-policy=automated
Result: Self-heal caught 47 accidental manual changes in first 3 months. Zero issues during incidents.
4. ArgoCD Projects Provide RBAC Boundaries
We created separate ArgoCD Projects per team:
backend-project → backend-team can only deploy to backend-* namespaces
frontend-project → frontend-team can only deploy to frontend-* namespaces
platform-project → platform-team has full access
Benefit: Teams have autonomy without risk of accidentally deploying to wrong namespace or cluster.
5. Image Tag Strategy Matters
Anti-pattern: Using :latest tag in production.
What we did:
- Dev:
:latest(fast iteration) - Staging:
:build-12345(commit SHA or build ID) - Production:
:v1.2.3(semantic version tags)
Promotion flow:
# After staging validation
cd apps/overlays/production/service-a
kustomize edit set image company.azurecr.io/service-a:v1.2.3
git commit -m "Promote service-a v1.2.3 to production"
6. Monitor ArgoCD Itself
ArgoCD is now critical infrastructure. We added:
- Health checks: Prometheus alerts on controller/repo-server health
- Sync lag alerts: Alert if ArgoCD hasn’t synced in 10 minutes
- Webhook monitoring: Alert if Git webhook delivery fails
# Alert if ArgoCD sync is stuck
time() - argocd_app_reconcile_time > 600
7. Handling Secrets with Sealed Secrets
We don’t store secrets in Git (obviously). Options we evaluated:
- Azure Key Vault + External Secrets Operator (our choice)
- Sealed Secrets (Bitnami)
- SOPS with Age encryption
# Example: ExternalSecret referencing Azure Key Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: service-a-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: azure-keyvault
kind: SecretStore
target:
name: service-a-secrets
data:
- secretKey: DB_PASSWORD
remoteRef:
key: service-a-db-password
ArgoCD syncs the ExternalSecret definition, ESO fetches actual secret from Key Vault.
Future Roadmap
Q2 2025: Progressive Delivery
Integrate Argo Rollouts for canary and blue-green deployments:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: service-a
spec:
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
Automated rollback based on Prometheus metrics.
Q3 2025: Multi-Cluster Management
Deploy ArgoCD in hub-and-spoke model:
- Central ArgoCD instance in management cluster
- Manages applications across dev, staging, prod clusters
- Single pane of glass for all environments
Q4 2025: Application Sets
Use ApplicationSets to reduce YAML duplication:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: services
spec:
generators:
- list:
elements:
- name: service-a
namespace: backend
- name: service-b
namespace: backend
template:
metadata:
name: '{{name}}-prod'
spec:
source:
path: apps/overlays/production/{{name}}
# ... rest of template
One ApplicationSet generates Applications for all services.
Conclusion
GitOps with ArgoCD transformed this company’s delivery model from manual and brittle to automated and reliable. The numbers are impressive (85% faster deployments, zero drift, 2-minute rollbacks), but the real win is cultural: Git became the interface for production.
Deployments are no longer scary. They’re boring—in the best possible way. Developers commit to Git, ArgoCD handles the rest. Rollbacks are Git reverts. Audit history is Git log. Configuration drift is impossible because ArgoCD continuously reconciles.
If you’re still running kubectl apply commands against production, you’re one typo away from an outage. GitOps with ArgoCD gives you declarative infrastructure, complete auditability, and automated reconciliation—exactly what modern platforms need.
About StriveNimbus
StriveNimbus specializes in platform engineering, GitOps implementation, and cloud-native architecture for Azure environments. We help organizations build reliable, auditable deployment pipelines that scale.
Ready to implement GitOps? Contact us for a platform assessment and GitOps transformation roadmap.
Technical Appendix
ArgoCD CLI Cheat Sheet
# Login to ArgoCD
argocd login argocd.company.com
# List applications
argocd app list
# Get application details
argocd app get service-a-prod
# Sync application manually
argocd app sync service-a-prod
# Rollback to previous version
argocd app rollback service-a-prod
# View sync history
argocd app history service-a-prod
# View diff between Git and cluster
argocd app diff service-a-prod
# Delete application (remove from cluster)
argocd app delete service-a-prod
# Suspend auto-sync
argocd app set service-a-prod --sync-policy=none
# Re-enable auto-sync
argocd app set service-a-prod --sync-policy=automated
Troubleshooting Common Issues
Issue: Application stuck in “Progressing” state
# Check events
argocd app get service-a-prod
# View logs
kubectl logs -n argocd deployment/argocd-application-controller | grep service-a-prod
# Check sync status
kubectl describe application service-a-prod -n argocd
Issue: ArgoCD not detecting Git changes
# Check webhook delivery in GitHub/Azure DevOps
# Manually refresh
argocd app get service-a-prod --refresh
# Force hard refresh
argocd app get service-a-prod --hard-refresh
Issue: Sync fails with “permission denied”
# Check ArgoCD Project RBAC
kubectl get appproject backend -n argocd -o yaml
# Verify destination cluster is registered
argocd cluster list