Chaos Mesh Cheat Sheet
Overview
Chaos Mesh is a CNCF incubating chaos engineering platform built specifically for Kubernetes environments. It provides a comprehensive set of fault injection capabilities including pod failures, network chaos, I/O chaos, time skew, JVM faults, and kernel-level disruptions. Chaos Mesh uses Kubernetes CRDs to define experiments declaratively, making chaos testing a natural part of GitOps and infrastructure-as-code workflows.
The platform features a powerful web dashboard for designing and monitoring experiments, fine-grained RBAC for controlling who can run chaos in which namespaces, and a workflow engine for orchestrating multi-step chaos scenarios. Chaos Mesh operates through a sidecar-less architecture using privileged DaemonSet pods to inject faults at the kernel level, enabling deep infrastructure chaos without modifying application containers.
Installation
Install via Helm
# Add Chaos Mesh Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
# Create namespace
kubectl create namespace chaos-mesh
# Install Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--set dashboard.securityMode=true
# For Docker runtime
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--set chaosDaemon.runtime=docker \
--set chaosDaemon.socketPath=/var/run/docker.sock
# Verify installation
kubectl get pods -n chaos-mesh
kubectl get crds | grep chaos-mesh
Install via Script
# Quick install script
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash -s -- \
--local kind \
--name chaos-testing
# Install specific version
curl -sSL https://mirrors.chaos-mesh.org/v2.7.0/install.sh | bash
Install chaosctl CLI
# Download chaosctl
curl -sSL https://github.com/chaos-mesh/chaos-mesh/releases/latest/download/chaosctl-linux-amd64 \
-o /usr/local/bin/chaosctl
chmod +x /usr/local/bin/chaosctl
# Verify installation
chaosctl version
# Debug environment
chaosctl debug --namespace default
Core Commands — Pod Chaos
Pod Kill and Failure
# pod-kill.yaml — Kill specific pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
namespace: default
spec:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: web-server
duration: "60s"
gracePeriod: 0
---
# pod-failure.yaml — Make pods unavailable
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-example
namespace: default
spec:
action: pod-failure
mode: fixed-percent
value: "50"
selector:
namespaces:
- default
labelSelectors:
app: web-server
duration: "120s"
---
# container-kill.yaml — Kill specific container in pod
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: container-kill-example
namespace: default
spec:
action: container-kill
mode: one
containerNames:
- sidecar
selector:
namespaces:
- default
labelSelectors:
app: web-server
duration: "30s"
# Apply chaos experiment
kubectl apply -f pod-kill.yaml
# Check experiment status
kubectl get podchaos pod-kill-example -o jsonpath='{.status.experiment.phase}'
# List all running chaos experiments
kubectl get podchaos,networkchaos,stresschaos,iochaos -A
# Delete experiment (stops chaos)
kubectl delete podchaos pod-kill-example
Core Commands — Network Chaos
# network-delay.yaml — Inject network latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
namespace: default
spec:
action: delay
mode: all
selector:
namespaces:
- default
labelSelectors:
app: frontend
delay:
latency: "200ms"
jitter: "50ms"
correlation: "25"
direction: to
target:
selector:
namespaces:
- default
labelSelectors:
app: backend
mode: all
duration: "120s"
---
# network-loss.yaml — Inject packet loss
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-loss
namespace: default
spec:
action: loss
mode: all
selector:
namespaces:
- default
labelSelectors:
app: api-server
loss:
loss: "30"
correlation: "25"
duration: "60s"
---
# network-partition.yaml — Network partition between services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
namespace: default
spec:
action: partition
mode: all
selector:
namespaces:
- default
labelSelectors:
app: service-a
direction: both
target:
selector:
namespaces:
- default
labelSelectors:
app: service-b
mode: all
duration: "60s"
---
# network-bandwidth.yaml — Limit bandwidth
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: bandwidth-limit
namespace: default
spec:
action: bandwidth
mode: all
selector:
namespaces:
- default
labelSelectors:
app: data-service
bandwidth:
rate: "1mbps"
limit: 20971520
buffer: 10000
duration: "120s"
Core Commands — Stress Chaos
# cpu-stress.yaml — CPU stress testing
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress
namespace: default
spec:
mode: one
selector:
namespaces:
- default
labelSelectors:
app: compute-service
stressors:
cpu:
workers: 2
load: 80
duration: "120s"
---
# memory-stress.yaml — Memory stress testing
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress
namespace: default
spec:
mode: all
selector:
namespaces:
- default
labelSelectors:
app: memory-service
stressors:
memory:
workers: 1
size: "512MB"
duration: "60s"
Core Commands — I/O Chaos
# io-latency.yaml — Inject filesystem I/O latency
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-latency
namespace: default
spec:
action: latency
mode: one
selector:
namespaces:
- default
labelSelectors:
app: database
volumePath: /var/lib/data
path: "*"
delay: "100ms"
percent: 50
duration: "60s"
---
# io-fault.yaml — Inject I/O errors
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-fault
namespace: default
spec:
action: fault
mode: one
selector:
namespaces:
- default
labelSelectors:
app: database
volumePath: /var/lib/data
path: "*.log"
errno: 5
percent: 100
duration: "30s"
Configuration
RBAC Configuration
# chaos-rbac.yaml — Restrict chaos to specific namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: chaos-engineer
namespace: staging
rules:
- apiGroups: ["chaos-mesh.org"]
resources: ["*"]
verbs: ["get", "list", "watch", "create", "delete", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: chaos-engineer-binding
namespace: staging
subjects:
- kind: User
name: sre-engineer@company.com
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: chaos-engineer
apiGroup: rbac.authorization.k8s.io
Dashboard Configuration
# dashboard-values.yaml for Helm
dashboard:
securityMode: true
service:
type: ClusterIP
ingress:
enabled: true
hosts:
- name: chaos.internal.company.com
paths:
- /
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-realm: "Chaos Mesh"
Advanced Usage
Chaos Workflows
# workflow.yaml — Multi-step chaos scenario
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: resilience-test-workflow
namespace: default
spec:
entry: the-entry
templates:
- name: the-entry
templateType: Serial
deadline: "600s"
children:
- network-delay-step
- pod-stress-step
- recovery-validation
- name: network-delay-step
templateType: NetworkChaos
deadline: "120s"
networkChaos:
action: delay
mode: all
selector:
labelSelectors:
app: api-server
delay:
latency: "300ms"
duration: "90s"
- name: pod-stress-step
templateType: StressChaos
deadline: "120s"
stressChaos:
mode: one
selector:
labelSelectors:
app: api-server
stressors:
cpu:
workers: 2
load: 90
duration: "90s"
- name: recovery-validation
templateType: Suspend
deadline: "60s"
Time Chaos (Clock Skew)
# time-chaos.yaml — Inject clock skew
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: time-skew
namespace: default
spec:
mode: one
selector:
namespaces:
- default
labelSelectors:
app: scheduler-service
timeOffset: "-2h"
clockIds:
- CLOCK_REALTIME
containerNames:
- scheduler
duration: "60s"
Schedule Chaos (Recurring)
# scheduled-chaos.yaml — Run chaos on a schedule
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: daily-pod-kill
namespace: default
spec:
schedule: "0 10 * * 1-5"
startingDeadlineSeconds: 60
concurrencyPolicy: Forbid
historyLimit: 5
type: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: web-server
duration: "30s"
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Chaos daemon not running | DaemonSet scheduling failed | Check node taints/tolerations on chaos-daemon |
| Network chaos has no effect | Wrong container runtime socket | Set correct chaosDaemon.socketPath in Helm values |
| I/O chaos not injecting | Volume not mounted with fuse | Verify volumePath matches container mount path |
| Permission denied on experiment | RBAC not configured | Create Role/RoleBinding for the user in target namespace |
| Time chaos not working | Clock IDs not available | Ensure kernel supports the specified clock ID |
Experiment stays in Running | Duration not set | Always set duration to auto-cleanup experiments |
| Dashboard not accessible | Service type mismatch | Switch to LoadBalancer or configure Ingress |
| Chaos not targeting correct pods | Label selector mismatch | Verify labels with kubectl get pods --show-labels |
# Debug chaos daemon
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=chaos-daemon --tail=50
# Check experiment status
kubectl describe podchaos <name>
kubectl get events --field-selector involvedObject.name=<chaos-name>
# List all chaos resources
kubectl get podchaos,networkchaos,stresschaos,iochaos,timechaos -A
# Clean up all chaos experiments
kubectl delete podchaos,networkchaos,stresschaos,iochaos,timechaos --all -A
# Verify chaos mesh components
chaosctl debug -n chaos-mesh