Chaos Mesh Cheat Sheet

Overview

Chaos Mesh is a CNCF incubating chaos engineering platform built specifically for Kubernetes environments. It provides a comprehensive set of fault injection capabilities including pod failures, network chaos, I/O chaos, time skew, JVM faults, and kernel-level disruptions. Chaos Mesh uses Kubernetes CRDs to define experiments declaratively, making chaos testing a natural part of GitOps and infrastructure-as-code workflows.

The platform features a powerful web dashboard for designing and monitoring experiments, fine-grained RBAC for controlling who can run chaos in which namespaces, and a workflow engine for orchestrating multi-step chaos scenarios. Chaos Mesh operates through a sidecar-less architecture using privileged DaemonSet pods to inject faults at the kernel level, enabling deep infrastructure chaos without modifying application containers.

Installation

Install via Helm

# Add Chaos Mesh Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Create namespace
kubectl create namespace chaos-mesh

# Install Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --set dashboard.securityMode=true

# For Docker runtime
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --set chaosDaemon.runtime=docker \
  --set chaosDaemon.socketPath=/var/run/docker.sock

# Verify installation
kubectl get pods -n chaos-mesh
kubectl get crds | grep chaos-mesh

Install via Script

# Quick install script
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash -s -- \
  --local kind \
  --name chaos-testing

# Install specific version
curl -sSL https://mirrors.chaos-mesh.org/v2.7.0/install.sh | bash

Install chaosctl CLI

# Download chaosctl
curl -sSL https://github.com/chaos-mesh/chaos-mesh/releases/latest/download/chaosctl-linux-amd64 \
  -o /usr/local/bin/chaosctl
chmod +x /usr/local/bin/chaosctl

# Verify installation
chaosctl version

# Debug environment
chaosctl debug --namespace default

Core Commands — Pod Chaos

Pod Kill and Failure

# pod-kill.yaml — Kill specific pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
  namespace: default
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: web-server
  duration: "60s"
  gracePeriod: 0

---
# pod-failure.yaml — Make pods unavailable
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-example
  namespace: default
spec:
  action: pod-failure
  mode: fixed-percent
  value: "50"
  selector:
    namespaces:
      - default
    labelSelectors:
      app: web-server
  duration: "120s"

---
# container-kill.yaml — Kill specific container in pod
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: container-kill-example
  namespace: default
spec:
  action: container-kill
  mode: one
  containerNames:
    - sidecar
  selector:
    namespaces:
      - default
    labelSelectors:
      app: web-server
  duration: "30s"

# Apply chaos experiment
kubectl apply -f pod-kill.yaml

# Check experiment status
kubectl get podchaos pod-kill-example -o jsonpath='{.status.experiment.phase}'

# List all running chaos experiments
kubectl get podchaos,networkchaos,stresschaos,iochaos -A

# Delete experiment (stops chaos)
kubectl delete podchaos pod-kill-example

Core Commands — Network Chaos

# network-delay.yaml — Inject network latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
  namespace: default
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: frontend
  delay:
    latency: "200ms"
    jitter: "50ms"
    correlation: "25"
  direction: to
  target:
    selector:
      namespaces:
        - default
      labelSelectors:
        app: backend
    mode: all
  duration: "120s"

---
# network-loss.yaml — Inject packet loss
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-loss
  namespace: default
spec:
  action: loss
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: api-server
  loss:
    loss: "30"
    correlation: "25"
  duration: "60s"

---
# network-partition.yaml — Network partition between services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
  namespace: default
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: service-a
  direction: both
  target:
    selector:
      namespaces:
        - default
      labelSelectors:
        app: service-b
    mode: all
  duration: "60s"

---
# network-bandwidth.yaml — Limit bandwidth
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: bandwidth-limit
  namespace: default
spec:
  action: bandwidth
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: data-service
  bandwidth:
    rate: "1mbps"
    limit: 20971520
    buffer: 10000
  duration: "120s"

Core Commands — Stress Chaos

# cpu-stress.yaml — CPU stress testing
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
  namespace: default
spec:
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: compute-service
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: "120s"

---
# memory-stress.yaml — Memory stress testing
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress
  namespace: default
spec:
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: memory-service
  stressors:
    memory:
      workers: 1
      size: "512MB"
  duration: "60s"

Core Commands — I/O Chaos

# io-latency.yaml — Inject filesystem I/O latency
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-latency
  namespace: default
spec:
  action: latency
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: database
  volumePath: /var/lib/data
  path: "*"
  delay: "100ms"
  percent: 50
  duration: "60s"

---
# io-fault.yaml — Inject I/O errors
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-fault
  namespace: default
spec:
  action: fault
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: database
  volumePath: /var/lib/data
  path: "*.log"
  errno: 5
  percent: 100
  duration: "30s"

Configuration

RBAC Configuration

# chaos-rbac.yaml — Restrict chaos to specific namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: chaos-engineer
  namespace: staging
rules:
  - apiGroups: ["chaos-mesh.org"]
    resources: ["*"]
    verbs: ["get", "list", "watch", "create", "delete", "patch", "update"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: chaos-engineer-binding
  namespace: staging
subjects:
  - kind: User
    name: sre-engineer@company.com
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: chaos-engineer
  apiGroup: rbac.authorization.k8s.io

Dashboard Configuration

# dashboard-values.yaml for Helm
dashboard:
  securityMode: true
  service:
    type: ClusterIP
  ingress:
    enabled: true
    hosts:
      - name: chaos.internal.company.com
        paths:
          - /
    annotations:
      nginx.ingress.kubernetes.io/auth-type: basic
      nginx.ingress.kubernetes.io/auth-realm: "Chaos Mesh"

Advanced Usage

Chaos Workflows

# workflow.yaml — Multi-step chaos scenario
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: resilience-test-workflow
  namespace: default
spec:
  entry: the-entry
  templates:
    - name: the-entry
      templateType: Serial
      deadline: "600s"
      children:
        - network-delay-step
        - pod-stress-step
        - recovery-validation

    - name: network-delay-step
      templateType: NetworkChaos
      deadline: "120s"
      networkChaos:
        action: delay
        mode: all
        selector:
          labelSelectors:
            app: api-server
        delay:
          latency: "300ms"
        duration: "90s"

    - name: pod-stress-step
      templateType: StressChaos
      deadline: "120s"
      stressChaos:
        mode: one
        selector:
          labelSelectors:
            app: api-server
        stressors:
          cpu:
            workers: 2
            load: 90
        duration: "90s"

    - name: recovery-validation
      templateType: Suspend
      deadline: "60s"

Time Chaos (Clock Skew)

# time-chaos.yaml — Inject clock skew
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: time-skew
  namespace: default
spec:
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: scheduler-service
  timeOffset: "-2h"
  clockIds:
    - CLOCK_REALTIME
  containerNames:
    - scheduler
  duration: "60s"

Schedule Chaos (Recurring)

# scheduled-chaos.yaml — Run chaos on a schedule
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: daily-pod-kill
  namespace: default
spec:
  schedule: "0 10 * * 1-5"
  startingDeadlineSeconds: 60
  concurrencyPolicy: Forbid
  historyLimit: 5
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - default
      labelSelectors:
        app: web-server
    duration: "30s"

Troubleshooting

Issue	Cause	Solution
Chaos daemon not running	DaemonSet scheduling failed	Check node taints/tolerations on chaos-daemon
Network chaos has no effect	Wrong container runtime socket	Set correct `chaosDaemon.socketPath` in Helm values
I/O chaos not injecting	Volume not mounted with fuse	Verify `volumePath` matches container mount path
Permission denied on experiment	RBAC not configured	Create Role/RoleBinding for the user in target namespace
Time chaos not working	Clock IDs not available	Ensure kernel supports the specified clock ID
Experiment stays in `Running`	Duration not set	Always set `duration` to auto-cleanup experiments
Dashboard not accessible	Service type mismatch	Switch to LoadBalancer or configure Ingress
Chaos not targeting correct pods	Label selector mismatch	Verify labels with `kubectl get pods --show-labels`

# Debug chaos daemon
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=chaos-daemon --tail=50

# Check experiment status
kubectl describe podchaos <name>
kubectl get events --field-selector involvedObject.name=<chaos-name>

# List all chaos resources
kubectl get podchaos,networkchaos,stresschaos,iochaos,timechaos -A

# Clean up all chaos experiments
kubectl delete podchaos,networkchaos,stresschaos,iochaos,timechaos --all -A

# Verify chaos mesh components
chaosctl debug -n chaos-mesh