Breaking Kubernetes to Build Confidence: A Chaos Mesh POC on Minikube

A hands-on guide to implementing chaos engineering with Chaos Mesh on Minikube. Learn how to build resilient Kubernetes applications by intentionally breaking them—Pod kills, network delays, and real-world verification strategies.

Divyansh Srivastav • Oct 31, 2025 • Cloud Infrastructure

DevOps & Cloud Architect | Azure | Kubernetes | Terraform | GitOps

I’ll never forget the first time a production Kubernetes cluster failed in a way we’d never tested. It was 2 AM, the on-call pager was screaming, and our application was down because a single pod failure cascaded through the entire system. We had unit tests, integration tests, even end-to-end tests—but we’d never tested what happens when Kubernetes itself misbehaves.

That’s when I discovered chaos engineering. Not as some academic exercise, but as a practical necessity for building systems that can handle the real world’s messiness.

Today, I’m going to walk you through setting up Chaos Mesh on Minikube—a zero-cost, local environment where you can break things safely and learn what resilience actually means.

Why Chaos Engineering Matters

Let’s get real for a second. Your Kubernetes cluster will fail. Pods will crash. Networks will slow down. Nodes will disappear. The question isn’t if, it’s when—and whether your system can handle it gracefully.

Traditional testing verifies that your code works when everything is perfect. Chaos engineering verifies that your system survives when everything goes wrong.

Here’s what chaos engineering helps you discover:

Hidden dependencies: That “stateless” service that’s actually relying on a specific pod being alive
Recovery blindspots: Your health checks pass, but your app is still broken
Performance cliffs: Everything works fine until one network hop adds 200ms of latency
Cascade failures: One component failing triggers a domino effect you never anticipated

I’ve seen chaos engineering catch production-breaking bugs that made it through code review, CI/CD, staging environments, and manual QA. It’s not a replacement for those practices—it’s the safety net that catches what they miss.

Enter Chaos Mesh

Chaos Mesh is a cloud-native chaos engineering platform built specifically for Kubernetes. Think of it as a controlled demolition toolkit for your cluster.

What makes Chaos Mesh compelling:

Kubernetes-native: Uses CRDs (Custom Resource Definitions) so chaos experiments are just another Kubernetes resource
Comprehensive failure scenarios: Network chaos, I/O chaos, stress testing, time chaos, kernel faults
Safety mechanisms: Built-in safeguards to prevent accidentally nuking your entire cluster
Observable: Integrates with Prometheus, Grafana, and standard Kubernetes observability tools
Scheduler: Run chaos on schedules or as one-off experiments

Most importantly, it’s production-ready. I’ve used Chaos Mesh in multi-million dollar environments, but today we’re starting simpler—a local Minikube cluster where you can experiment freely.

Prerequisites: What You’ll Need

Before we dive in, make sure you have:

Docker Desktop installed and running
Minikube (version 1.30+ recommended)
kubectl configured
Helm 3 (optional but recommended for easier installation)
At least 4GB of RAM allocated to Docker Desktop (Minikube + Chaos Mesh need breathing room)

You can verify your setup:

# Check Minikube
minikube version

# Check kubectl
kubectl version --client

# Check Helm
helm version

All good? Let’s break some things.

Setting Up Minikube

First, let’s spin up a local Kubernetes cluster. I’m using the Docker driver because it’s cross-platform and plays nicely with Docker Desktop:

# Start Minikube with enough resources
minikube start --cpus=4 --memory=4096 --driver=docker

# Verify it's running
kubectl get nodes

You should see something like:

NAME       STATUS   ROLES           AGE   VERSION
minikube   Ready    control-plane   45s   v1.28.3

That’s your playground. A real Kubernetes cluster, running locally, ready to be chaotic.

Installing Chaos Mesh

There are multiple ways to install Chaos Mesh, but I prefer Helm for its simplicity and reproducibility:

# Add the Chaos Mesh Helm repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Create a namespace for Chaos Mesh
kubectl create namespace chaos-mesh

# Install Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace=chaos-mesh \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --version 2.6.3

Note: Minikube uses containerd as the runtime, so we explicitly set that. If you’re on an older Minikube version using Docker, change runtime=docker and socketPath=/var/run/docker.sock.

Wait for everything to come up:

kubectl get pods -n chaos-mesh -w

You should see:

NAME                                        READY   STATUS    RESTARTS   AGE
chaos-controller-manager-6c7b8d8c4b-xvt9m   1/1     Running   0          2m
chaos-daemon-vl7xn                          1/1     Running   0          2m
chaos-dashboard-6d4b8f9d8c-kzp4x            1/1     Running   0          2m

Chaos Mesh is now installed. Let’s verify it has the permissions it needs:

kubectl get crds | grep chaos-mesh.org

You should see a list of Custom Resource Definitions like podchaos.chaos-mesh.org, networkchaos.chaos-mesh.org, etc. These are the tools we’ll use to inject failures.

Accessing the Chaos Mesh Dashboard

Chaos Mesh includes a web UI that’s incredibly helpful for visualizing experiments:

# Port-forward the dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

Open your browser to http://localhost:2333. You’ll see the Chaos Mesh dashboard—a clean interface for creating, monitoring, and analyzing chaos experiments.

For now, let’s stick with YAML and kubectl (the DevOps way), but the dashboard is great for exploring what’s possible.

Setting Up a Target Application

We need something to break. Let’s deploy a simple web application:

# Create a namespace for our demo app
kubectl create namespace demo-app

# Deploy a simple nginx application
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: demo-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: web-app-service
  namespace: demo-app
spec:
  selector:
    app: web-app
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP
EOF

Verify it’s running:

kubectl get pods -n demo-app

You should see three nginx pods, all running. This is our baseline—a healthy, functioning application. Let’s see how it handles adversity.

Experiment 1: Pod Kill Chaos

This is the most common failure scenario in Kubernetes: a pod suddenly dies. Maybe it’s an OOMKill, a node failure, or a bug that causes a crash. Your application should handle this gracefully—but does it?

The Experiment YAML

Create a file called pod-kill-chaos.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-experiment
  namespace: demo-app
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - demo-app
    labelSelectors:
      app: web-app
  scheduler:
    cron: "@every 30s"

Let’s break this down:

action: pod-kill: Randomly terminates a pod (simulates crashes)
mode: one: Kills one pod at a time (could be all, fixed, or fixed-percent)
selector: Targets pods in the demo-app namespace with label app: web-app
scheduler: Repeats every 30 seconds (so we see multiple failures)

Running the Experiment

Apply the chaos:

kubectl apply -f pod-kill-chaos.yaml

Verification and Observation

Open a second terminal and watch your pods:

kubectl get pods -n demo-app -w

You’ll see something like this:

NAME                       READY   STATUS    RESTARTS   AGE
web-app-7d4b8c9f6d-abc12   1/1     Running   0          5m
web-app-7d4b8c9f6d-def34   1/1     Running   0          5m
web-app-7d4b8c9f6d-ghi56   1/1     Running   0          5m
web-app-7d4b8c9f6d-abc12   1/1     Terminating   0      5m
web-app-7d4b8c9f6d-jkl78   0/1     Pending       0      0s
web-app-7d4b8c9f6d-jkl78   0/1     ContainerCreating   0      0s
web-app-7d4b8c9f6d-jkl78   1/1     Running             0      2s

What you’re observing:

Chaos Mesh kills a pod (Terminating)
Kubernetes immediately notices the replica count is off
A new pod is scheduled and created
After a few seconds, you’re back to 3 healthy replicas

This is Kubernetes doing what it does best—self-healing. But here’s the critical question: Did your application experience downtime?

Testing Application Availability

In a third terminal, continuously hit the service:

# Port-forward the service
kubectl port-forward -n demo-app svc/web-app-service 8080:80 &

# Continuously test availability
while true; do
  curl -s -o /dev/null -w "Status: %{http_code} - Time: %{time_total}s\n" http://localhost:8080
  sleep 1
done

With 3 replicas and proper health checks, you should see zero failures. But what if you scale down to 1 replica?

kubectl scale deployment web-app -n demo-app --replicas=1

Now watch those curl requests. You’ll likely see some failures during pod restarts. This is the insight: Your application needs either multiple replicas or graceful termination handling to survive pod kills.

Cleaning Up

Stop the chaos:

kubectl delete podchaos pod-kill-experiment -n demo-app

Experiment 2: Network Delay Chaos

Pod failures are dramatic, but network latency is insidious. A sudden 500ms delay between your service and its database can turn a responsive application into a timeout-riddled nightmare.

Let’s simulate this.

The Experiment YAML

Create network-delay-chaos.yaml:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-experiment
  namespace: demo-app
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - demo-app
    labelSelectors:
      app: web-app
  delay:
    latency: "300ms"
    correlation: "50"
    jitter: "100ms"
  duration: "5m"
  scheduler:
    cron: "@every 10m"

Key parameters:

action: delay: Adds latency to network packets
mode: all: Affects all matching pods
latency: “300ms”: Base delay
jitter: “100ms”: Random variance (so delays range from 200-400ms)
correlation: “50”: Makes consecutive packets have similar delays (more realistic)
duration: “5m”: Each chaos injection lasts 5 minutes

Running the Experiment

Apply it:

kubectl apply -f network-delay-chaos.yaml

Verification and Observation

Test response times:

# Make sure you have 3 replicas again
kubectl scale deployment web-app -n demo-app --replicas=3

# Test response times
for i in {1..20}; do
  curl -s -o /dev/null -w "Request $i - Time: %{time_total}s\n" http://localhost:8080
  sleep 1
done

Without chaos, you’d see response times around 0.002s. With network delay, you’ll see 0.300s+ responses.

Real-world implications:

This might not break your application, but it could trigger timeouts in dependent services
If your frontend has a 500ms timeout for backend calls, you just started seeing failures
Database queries that used to take 50ms now take 350ms—multiplied across thousands of requests, this is a scalability disaster

What You Should Look For

Check your application logs:

kubectl logs -n demo-app -l app=web-app --tail=50

Do you see timeout errors? Connection pool exhaustion? These are the symptoms of latency issues that only chaos engineering exposes.

Cleaning Up

Remove the chaos:

kubectl delete networkchaos network-delay-experiment -n demo-app

Putting It Together: A Realistic Chaos Scenario

In production, failures don’t happen in isolation. Let’s create a combined experiment that’s more realistic:

apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: realistic-chaos-workflow
  namespace: demo-app
spec:
  entry: the-chaos
  templates:
    - name: the-chaos
      templateType: Parallel
      children:
        - pod-failure-stress
        - network-latency-stress

    - name: pod-failure-stress
      templateType: PodChaos
      deadline: 5m
      podChaos:
        action: pod-kill
        mode: fixed
        value: "1"
        selector:
          namespaces:
            - demo-app
          labelSelectors:
            app: web-app

    - name: network-latency-stress
      templateType: NetworkChaos
      deadline: 5m
      networkChaos:
        action: delay
        mode: all
        selector:
          namespaces:
            - demo-app
          labelSelectors:
            app: web-app
        delay:
          latency: "250ms"
          jitter: "50ms"

This workflow runs both pod kills and network delays simultaneously for 5 minutes. It’s closer to what actual production incidents look like—multiple failures compounding.

Apply it and watch your application struggle:

kubectl apply -f realistic-chaos-workflow.yaml

If your application survives this with minimal impact, you’ve built something resilient.

Key Takeaways and Real-World Lessons

After running chaos experiments for years, here’s what I’ve learned:

1. Start Small, Measure Everything

Don’t start with “let’s kill random nodes in production.” Start in dev. Measure baseline performance, inject one type of failure, measure again. Build confidence incrementally.

2. Automate Chaos in CI/CD

The best teams run chaos experiments as part of their deployment pipeline. If your canary deployment can’t survive a pod kill, it doesn’t go to production.

3. Chaos Reveals Hidden Dependencies

That service you thought was stateless? Chaos engineering will prove you wrong. I’ve seen applications mysteriously fail during pod kills because they were caching connection strings and never refreshing them.

4. Health Checks Aren’t Enough

Your /health endpoint returns 200, but your application is broken. Why? Because health checks test infrastructure, not functionality. Chaos engineering tests whether your app actually works under duress.

5. Recovery Time Matters More Than Uptime

A system that recovers in 2 seconds from a failure is better than one that never fails but takes 30 minutes to recover when it does. Chaos engineering optimizes for MTTR (Mean Time To Recovery), not just MTBF (Mean Time Between Failures).

Moving Beyond Minikube

This POC is a starting point. Here’s how to graduate to production chaos engineering:

Run experiments in staging: Mirror your production environment, same traffic patterns
Use GameDays: Schedule chaos experiments where the whole team participates and observes
Integrate with monitoring: Tie Chaos Mesh to your Prometheus/Grafana setup so you correlate failures with metrics
Automate verification: Don’t manually check if things broke—have automated assertions that verify expected behavior
Document lessons learned: Every chaos experiment should result in either a code fix, a runbook update, or validation that you’re resilient

The Confidence Chaos Brings

Here’s the thing about chaos engineering: it feels uncomfortable at first. You’re intentionally breaking things. It goes against every instinct you have as an engineer.

But once you’ve run chaos in staging, seen your application survive pod kills and network delays, and deployed to production knowing you’ve stress-tested failure scenarios—that’s a different feeling. That’s confidence.

I sleep better knowing that the systems I’ve built have been battle-tested against realistic failures. Not just “will it work?” but “will it keep working when everything goes wrong?”

Chaos Mesh makes this accessible. You don’t need a massive budget or a dedicated SRE team. You need Minikube, 30 minutes, and the willingness to break things intentionally so they don’t break accidentally.

If you’re exploring chaos in your infrastructure, start small, measure recovery, and build confidence—one experiment at a time.

Additional Resources

Now go break something. Intentionally.

Breaking Kubernetes to Build Confidence: A Chaos Mesh POC on Minikube

Why Chaos Engineering Matters

Enter Chaos Mesh

Prerequisites: What You’ll Need

Setting Up Minikube

Installing Chaos Mesh

Accessing the Chaos Mesh Dashboard

Setting Up a Target Application

Experiment 1: Pod Kill Chaos

The Experiment YAML

Running the Experiment

Verification and Observation

Testing Application Availability

Cleaning Up

Experiment 2: Network Delay Chaos

The Experiment YAML

Running the Experiment

Verification and Observation

What You Should Look For

Cleaning Up

Putting It Together: A Realistic Chaos Scenario

Key Takeaways and Real-World Lessons

1. Start Small, Measure Everything

2. Automate Chaos in CI/CD

3. Chaos Reveals Hidden Dependencies

4. Health Checks Aren’t Enough

5. Recovery Time Matters More Than Uptime

Moving Beyond Minikube

The Confidence Chaos Brings

Additional Resources

Share this article:

Related Articles

Multi-Region AKS Deployments Using Workload Identity

Platform Engineering with Internal Developer Platforms: Building Self-Service Infrastructure

AKS Upgrade Strategy: Choosing Between Surge, Blue-Green Node Pool, and Blue-Green Cluster Approaches