Breaking Kubernetes to Build Confidence: A Chaos Mesh POC on Minikube
A hands-on guide to implementing chaos engineering with Chaos Mesh on Minikube. Learn how to build resilient Kubernetes applications by intentionally breaking them—Pod kills, network delays, and real-world verification strategies.
I’ll never forget the first time a production Kubernetes cluster failed in a way we’d never tested. It was 2 AM, the on-call pager was screaming, and our application was down because a single pod failure cascaded through the entire system. We had unit tests, integration tests, even end-to-end tests—but we’d never tested what happens when Kubernetes itself misbehaves.
That’s when I discovered chaos engineering. Not as some academic exercise, but as a practical necessity for building systems that can handle the real world’s messiness.
Today, I’m going to walk you through setting up Chaos Mesh on Minikube—a zero-cost, local environment where you can break things safely and learn what resilience actually means.
Why Chaos Engineering Matters
Let’s get real for a second. Your Kubernetes cluster will fail. Pods will crash. Networks will slow down. Nodes will disappear. The question isn’t if, it’s when—and whether your system can handle it gracefully.
Traditional testing verifies that your code works when everything is perfect. Chaos engineering verifies that your system survives when everything goes wrong.
Here’s what chaos engineering helps you discover:
- Hidden dependencies: That “stateless” service that’s actually relying on a specific pod being alive
- Recovery blindspots: Your health checks pass, but your app is still broken
- Performance cliffs: Everything works fine until one network hop adds 200ms of latency
- Cascade failures: One component failing triggers a domino effect you never anticipated
I’ve seen chaos engineering catch production-breaking bugs that made it through code review, CI/CD, staging environments, and manual QA. It’s not a replacement for those practices—it’s the safety net that catches what they miss.
Enter Chaos Mesh
Chaos Mesh is a cloud-native chaos engineering platform built specifically for Kubernetes. Think of it as a controlled demolition toolkit for your cluster.
What makes Chaos Mesh compelling:
- Kubernetes-native: Uses CRDs (Custom Resource Definitions) so chaos experiments are just another Kubernetes resource
- Comprehensive failure scenarios: Network chaos, I/O chaos, stress testing, time chaos, kernel faults
- Safety mechanisms: Built-in safeguards to prevent accidentally nuking your entire cluster
- Observable: Integrates with Prometheus, Grafana, and standard Kubernetes observability tools
- Scheduler: Run chaos on schedules or as one-off experiments
Most importantly, it’s production-ready. I’ve used Chaos Mesh in multi-million dollar environments, but today we’re starting simpler—a local Minikube cluster where you can experiment freely.
Prerequisites: What You’ll Need
Before we dive in, make sure you have:
- Docker Desktop installed and running
- Minikube (version 1.30+ recommended)
- kubectl configured
- Helm 3 (optional but recommended for easier installation)
- At least 4GB of RAM allocated to Docker Desktop (Minikube + Chaos Mesh need breathing room)
You can verify your setup:
# Check Minikube
minikube version
# Check kubectl
kubectl version --client
# Check Helm
helm version
All good? Let’s break some things.
Setting Up Minikube
First, let’s spin up a local Kubernetes cluster. I’m using the Docker driver because it’s cross-platform and plays nicely with Docker Desktop:
# Start Minikube with enough resources
minikube start --cpus=4 --memory=4096 --driver=docker
# Verify it's running
kubectl get nodes
You should see something like:
NAME STATUS ROLES AGE VERSION
minikube Ready control-plane 45s v1.28.3
That’s your playground. A real Kubernetes cluster, running locally, ready to be chaotic.
Installing Chaos Mesh
There are multiple ways to install Chaos Mesh, but I prefer Helm for its simplicity and reproducibility:
# Add the Chaos Mesh Helm repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
# Create a namespace for Chaos Mesh
kubectl create namespace chaos-mesh
# Install Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace=chaos-mesh \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--version 2.6.3
Note: Minikube uses containerd as the runtime, so we explicitly set that. If you’re on an older Minikube version using Docker, change runtime=docker and socketPath=/var/run/docker.sock.
Wait for everything to come up:
kubectl get pods -n chaos-mesh -w
You should see:
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-6c7b8d8c4b-xvt9m 1/1 Running 0 2m
chaos-daemon-vl7xn 1/1 Running 0 2m
chaos-dashboard-6d4b8f9d8c-kzp4x 1/1 Running 0 2m
Chaos Mesh is now installed. Let’s verify it has the permissions it needs:
kubectl get crds | grep chaos-mesh.org
You should see a list of Custom Resource Definitions like podchaos.chaos-mesh.org, networkchaos.chaos-mesh.org, etc. These are the tools we’ll use to inject failures.
Accessing the Chaos Mesh Dashboard
Chaos Mesh includes a web UI that’s incredibly helpful for visualizing experiments:
# Port-forward the dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
Open your browser to http://localhost:2333. You’ll see the Chaos Mesh dashboard—a clean interface for creating, monitoring, and analyzing chaos experiments.
For now, let’s stick with YAML and kubectl (the DevOps way), but the dashboard is great for exploring what’s possible.
Setting Up a Target Application
We need something to break. Let’s deploy a simple web application:
# Create a namespace for our demo app
kubectl create namespace demo-app
# Deploy a simple nginx application
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: demo-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: web-app-service
namespace: demo-app
spec:
selector:
app: web-app
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP
EOF
Verify it’s running:
kubectl get pods -n demo-app
You should see three nginx pods, all running. This is our baseline—a healthy, functioning application. Let’s see how it handles adversity.
Experiment 1: Pod Kill Chaos
This is the most common failure scenario in Kubernetes: a pod suddenly dies. Maybe it’s an OOMKill, a node failure, or a bug that causes a crash. Your application should handle this gracefully—but does it?
The Experiment YAML
Create a file called pod-kill-chaos.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-experiment
namespace: demo-app
spec:
action: pod-kill
mode: one
selector:
namespaces:
- demo-app
labelSelectors:
app: web-app
scheduler:
cron: "@every 30s"
Let’s break this down:
- action: pod-kill: Randomly terminates a pod (simulates crashes)
- mode: one: Kills one pod at a time (could be
all,fixed, orfixed-percent) - selector: Targets pods in the
demo-appnamespace with labelapp: web-app - scheduler: Repeats every 30 seconds (so we see multiple failures)
Running the Experiment
Apply the chaos:
kubectl apply -f pod-kill-chaos.yaml
Verification and Observation
Open a second terminal and watch your pods:
kubectl get pods -n demo-app -w
You’ll see something like this:
NAME READY STATUS RESTARTS AGE
web-app-7d4b8c9f6d-abc12 1/1 Running 0 5m
web-app-7d4b8c9f6d-def34 1/1 Running 0 5m
web-app-7d4b8c9f6d-ghi56 1/1 Running 0 5m
web-app-7d4b8c9f6d-abc12 1/1 Terminating 0 5m
web-app-7d4b8c9f6d-jkl78 0/1 Pending 0 0s
web-app-7d4b8c9f6d-jkl78 0/1 ContainerCreating 0 0s
web-app-7d4b8c9f6d-jkl78 1/1 Running 0 2s
What you’re observing:
- Chaos Mesh kills a pod (Terminating)
- Kubernetes immediately notices the replica count is off
- A new pod is scheduled and created
- After a few seconds, you’re back to 3 healthy replicas
This is Kubernetes doing what it does best—self-healing. But here’s the critical question: Did your application experience downtime?
Testing Application Availability
In a third terminal, continuously hit the service:
# Port-forward the service
kubectl port-forward -n demo-app svc/web-app-service 8080:80 &
# Continuously test availability
while true; do
curl -s -o /dev/null -w "Status: %{http_code} - Time: %{time_total}s\n" http://localhost:8080
sleep 1
done
With 3 replicas and proper health checks, you should see zero failures. But what if you scale down to 1 replica?
kubectl scale deployment web-app -n demo-app --replicas=1
Now watch those curl requests. You’ll likely see some failures during pod restarts. This is the insight: Your application needs either multiple replicas or graceful termination handling to survive pod kills.
Cleaning Up
Stop the chaos:
kubectl delete podchaos pod-kill-experiment -n demo-app
Experiment 2: Network Delay Chaos
Pod failures are dramatic, but network latency is insidious. A sudden 500ms delay between your service and its database can turn a responsive application into a timeout-riddled nightmare.
Let’s simulate this.
The Experiment YAML
Create network-delay-chaos.yaml:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-experiment
namespace: demo-app
spec:
action: delay
mode: all
selector:
namespaces:
- demo-app
labelSelectors:
app: web-app
delay:
latency: "300ms"
correlation: "50"
jitter: "100ms"
duration: "5m"
scheduler:
cron: "@every 10m"
Key parameters:
- action: delay: Adds latency to network packets
- mode: all: Affects all matching pods
- latency: “300ms”: Base delay
- jitter: “100ms”: Random variance (so delays range from 200-400ms)
- correlation: “50”: Makes consecutive packets have similar delays (more realistic)
- duration: “5m”: Each chaos injection lasts 5 minutes
Running the Experiment
Apply it:
kubectl apply -f network-delay-chaos.yaml
Verification and Observation
Test response times:
# Make sure you have 3 replicas again
kubectl scale deployment web-app -n demo-app --replicas=3
# Test response times
for i in {1..20}; do
curl -s -o /dev/null -w "Request $i - Time: %{time_total}s\n" http://localhost:8080
sleep 1
done
Without chaos, you’d see response times around 0.002s. With network delay, you’ll see 0.300s+ responses.
Real-world implications:
- This might not break your application, but it could trigger timeouts in dependent services
- If your frontend has a 500ms timeout for backend calls, you just started seeing failures
- Database queries that used to take 50ms now take 350ms—multiplied across thousands of requests, this is a scalability disaster
What You Should Look For
Check your application logs:
kubectl logs -n demo-app -l app=web-app --tail=50
Do you see timeout errors? Connection pool exhaustion? These are the symptoms of latency issues that only chaos engineering exposes.
Cleaning Up
Remove the chaos:
kubectl delete networkchaos network-delay-experiment -n demo-app
Putting It Together: A Realistic Chaos Scenario
In production, failures don’t happen in isolation. Let’s create a combined experiment that’s more realistic:
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: realistic-chaos-workflow
namespace: demo-app
spec:
entry: the-chaos
templates:
- name: the-chaos
templateType: Parallel
children:
- pod-failure-stress
- network-latency-stress
- name: pod-failure-stress
templateType: PodChaos
deadline: 5m
podChaos:
action: pod-kill
mode: fixed
value: "1"
selector:
namespaces:
- demo-app
labelSelectors:
app: web-app
- name: network-latency-stress
templateType: NetworkChaos
deadline: 5m
networkChaos:
action: delay
mode: all
selector:
namespaces:
- demo-app
labelSelectors:
app: web-app
delay:
latency: "250ms"
jitter: "50ms"
This workflow runs both pod kills and network delays simultaneously for 5 minutes. It’s closer to what actual production incidents look like—multiple failures compounding.
Apply it and watch your application struggle:
kubectl apply -f realistic-chaos-workflow.yaml
If your application survives this with minimal impact, you’ve built something resilient.
Key Takeaways and Real-World Lessons
After running chaos experiments for years, here’s what I’ve learned:
1. Start Small, Measure Everything
Don’t start with “let’s kill random nodes in production.” Start in dev. Measure baseline performance, inject one type of failure, measure again. Build confidence incrementally.
2. Automate Chaos in CI/CD
The best teams run chaos experiments as part of their deployment pipeline. If your canary deployment can’t survive a pod kill, it doesn’t go to production.
3. Chaos Reveals Hidden Dependencies
That service you thought was stateless? Chaos engineering will prove you wrong. I’ve seen applications mysteriously fail during pod kills because they were caching connection strings and never refreshing them.
4. Health Checks Aren’t Enough
Your /health endpoint returns 200, but your application is broken. Why? Because health checks test infrastructure, not functionality. Chaos engineering tests whether your app actually works under duress.
5. Recovery Time Matters More Than Uptime
A system that recovers in 2 seconds from a failure is better than one that never fails but takes 30 minutes to recover when it does. Chaos engineering optimizes for MTTR (Mean Time To Recovery), not just MTBF (Mean Time Between Failures).
Moving Beyond Minikube
This POC is a starting point. Here’s how to graduate to production chaos engineering:
- Run experiments in staging: Mirror your production environment, same traffic patterns
- Use GameDays: Schedule chaos experiments where the whole team participates and observes
- Integrate with monitoring: Tie Chaos Mesh to your Prometheus/Grafana setup so you correlate failures with metrics
- Automate verification: Don’t manually check if things broke—have automated assertions that verify expected behavior
- Document lessons learned: Every chaos experiment should result in either a code fix, a runbook update, or validation that you’re resilient
The Confidence Chaos Brings
Here’s the thing about chaos engineering: it feels uncomfortable at first. You’re intentionally breaking things. It goes against every instinct you have as an engineer.
But once you’ve run chaos in staging, seen your application survive pod kills and network delays, and deployed to production knowing you’ve stress-tested failure scenarios—that’s a different feeling. That’s confidence.
I sleep better knowing that the systems I’ve built have been battle-tested against realistic failures. Not just “will it work?” but “will it keep working when everything goes wrong?”
Chaos Mesh makes this accessible. You don’t need a massive budget or a dedicated SRE team. You need Minikube, 30 minutes, and the willingness to break things intentionally so they don’t break accidentally.
If you’re exploring chaos in your infrastructure, start small, measure recovery, and build confidence—one experiment at a time.
Additional Resources
Now go break something. Intentionally.