Gremlin Cheat Sheet
Overview
Gremlin is an enterprise chaos engineering platform that enables teams to proactively test system resilience by injecting controlled failures into infrastructure, applications, and networks. Unlike open-source alternatives, Gremlin provides a fully managed SaaS experience with an intuitive UI, comprehensive safety controls including automatic halt conditions, and built-in best practices that guide teams through their chaos engineering journey from simple experiments to complex multi-target scenarios.
The platform supports chaos attacks across Kubernetes, cloud VMs, containers, bare metal servers, and serverless environments. Gremlin organizes experiments into three categories: resource attacks (CPU, memory, disk, I/O), network attacks (latency, packet loss, DNS, blackhole), and state attacks (process kill, time travel, shutdown). Its Scenarios feature allows teams to build reusable multi-step attack sequences that model real-world failure modes, while Status Checks validate system health during experiments.
Installation
Gremlin Agent on Linux
# Add Gremlin APT repository (Debian/Ubuntu)
echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 05AE748A3A9EC419
sudo apt-get update
sudo apt-get install -y gremlin gremlind
# Add Gremlin YUM repository (RHEL/CentOS)
sudo rpm --import https://rpm.gremlin.com/gpg-key
cat <<EOF | sudo tee /etc/yum.repos.d/gremlin.repo
[gremlin]
name=Gremlin
baseurl=https://rpm.gremlin.com/
gpgcheck=1
gpgkey=https://rpm.gremlin.com/gpg-key
EOF
sudo yum install -y gremlin gremlind
# Initialize Gremlin agent with credentials
sudo gremlin init --team-id "YOUR_TEAM_ID" \
--team-secret "YOUR_TEAM_SECRET"
# Start the agent
sudo systemctl start gremlind
sudo systemctl enable gremlind
# Verify agent is connected
gremlin check auth
Kubernetes Installation
# Add Gremlin Helm repo
helm repo add gremlin https://helm.gremlin.com
helm repo update
# Create namespace and secret
kubectl create namespace gremlin
kubectl create secret generic gremlin-team-secret \
--namespace gremlin \
--from-literal=GREMLIN_TEAM_ID="YOUR_TEAM_ID" \
--from-literal=GREMLIN_TEAM_SECRET="YOUR_TEAM_SECRET"
# Install Gremlin agent (Kubernetes)
helm install gremlin gremlin/gremlin \
--namespace gremlin \
--set gremlin.secret.name=gremlin-team-secret \
--set gremlin.secret.type=secret \
--set gremlin.collect.processes=true \
--set gremlin.collect.dns=true
# Verify pods are running
kubectl get pods -n gremlin
Gremlin CLI
# Install Gremlin CLI
pip install gremlinapi
# Or via npm
npm install -g gremlin-cli
# Configure API credentials
export GREMLIN_API_KEY="your-api-key"
export GREMLIN_TEAM_ID="your-team-id"
# Verify CLI access
gremlin version
Core Commands — Resource Attacks
CPU Attack
# CPU attack via CLI
gremlin attack cpu \
--length 120 \
--cores 2 \
--percent 80 \
--target-tags "service=api,env=staging"
# CPU attack via API
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"command": {
"type": "cpu",
"args": ["-c", "2", "-p", "80", "-l", "120"]
},
"target": {
"type": "Exact",
"exact": ["host-identifier"]
}
}'
Memory Attack
# Memory attack — consume 512MB for 60 seconds
gremlin attack memory \
--length 60 \
--amount 512 \
--unit MB \
--target-tags "service=cache,env=staging"
# Memory attack via API
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"command": {
"type": "memory",
"args": ["-g", "512", "-u", "MB", "-l", "60"]
},
"target": {
"type": "Random",
"tags": {"service": "cache", "env": "staging"},
"percent": 50
}
}'
Disk and I/O Attacks
# Disk space attack — fill 90% of disk
gremlin attack disk \
--length 60 \
--dir /tmp \
--workers 1 \
--block-size 4 \
--percent 90 \
--target-tags "role=worker"
# I/O attack — high IOPS
gremlin attack io \
--length 120 \
--dir /tmp \
--workers 4 \
--mode rw \
--block-size 64 \
--block-count 1000 \
--target-tags "service=database"
Core Commands — Network Attacks
# Network latency attack
gremlin attack latency \
--length 120 \
--ms 200 \
--jitter 50 \
--target-tags "service=frontend,env=staging"
# Packet loss attack
gremlin attack packet-loss \
--length 60 \
--percent 30 \
--corrupt 5 \
--target-tags "service=api"
# DNS blackhole — block DNS resolution
gremlin attack dns \
--length 60 \
--target-tags "service=api" \
-- -d "external-api.example.com"
# Blackhole — drop all network traffic to specific IPs
gremlin attack blackhole \
--length 60 \
--target-tags "service=api" \
-- -h "10.0.1.50" -p "^5432"
# Bandwidth limit
gremlin attack bandwidth \
--length 120 \
--amount 1 \
--unit mbps \
--target-tags "service=data-sync"
Core Commands — State Attacks
# Process kill attack
gremlin attack process-killer \
--length 60 \
--interval 15 \
--target-tags "service=worker" \
-- -p "worker-process" --exact
# Shutdown attack (graceful)
gremlin attack shutdown \
--delay 30 \
--reboot \
--target-tags "role=canary,env=staging"
# Time travel attack (clock skew)
gremlin attack time-travel \
--length 60 \
--offset 3600 \
--target-tags "service=scheduler"
Core Commands — Kubernetes Attacks
# Target Kubernetes pods
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"command": {
"type": "cpu",
"args": ["-c", "1", "-p", "90", "-l", "60"]
},
"target": {
"type": "Exact",
"strategy": {
"type": "Kubernetes",
"k8sObjects": [
{
"cluster": "production",
"namespace": "default",
"kind": "Deployment",
"name": "api-server",
"containerSelection": {
"selectionType": "ANY"
}
}
]
}
}
}'
# Kill specific Kubernetes pods
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"command": {
"type": "process_killer",
"args": ["-l", "30", "--kill-children"]
},
"target": {
"type": "Random",
"strategy": {
"type": "Kubernetes",
"k8sObjects": [
{
"cluster": "production",
"namespace": "default",
"kind": "Deployment",
"name": "web-frontend"
}
]
},
"percent": 33
}
}'
Configuration
Gremlin Agent Configuration
# /etc/gremlin/config.yaml
teamId: "your-team-id"
teamSecret: "your-team-secret"
# Or use certificate-based auth
teamId: "your-team-id"
teamCertificate: "/etc/gremlin/certs/team.cert"
teamPrivateKey: "/etc/gremlin/certs/team.key"
# Agent settings
identifier: "custom-host-name"
tags:
service: api-server
environment: production
region: us-east-1
tier: critical
# Collection settings
collectProcesses: true
collectDns: true
collectServices: true
# Proxy settings (if behind corporate proxy)
httpsProxy: "http://proxy.internal:3128"
Helm Values for Kubernetes
# gremlin-values.yaml
gremlin:
secret:
name: gremlin-team-secret
type: secret
collect:
processes: true
dns: true
hostPID: true
hostNetwork: false
client:
tags:
- cluster=production
- region=us-east-1
image:
repository: gremlin/gremlin
tag: latest
pullPolicy: IfNotPresent
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
tolerations:
- key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
Advanced Usage
Scenarios (Multi-Step Attacks)
# Create a scenario via API
curl -X POST "https://api.gremlin.com/v1/scenarios" \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Cascading Failure Test",
"description": "Simulate cascading failure from database to API",
"hypothesis": "Circuit breakers prevent cascade",
"steps": [
{
"attacks": [
{
"command": {"type": "latency", "args": ["-m", "500", "-l", "120"]},
"target": {
"type": "Exact",
"strategy": {"type": "Kubernetes", "k8sObjects": [
{"cluster": "prod", "namespace": "default", "kind": "Deployment", "name": "database-proxy"}
]}
}
}
],
"delay": 30
},
{
"attacks": [
{
"command": {"type": "cpu", "args": ["-c", "2", "-p", "95", "-l", "60"]},
"target": {
"type": "Random",
"tags": {"service": "api-server"},
"percent": 50
}
}
],
"delay": 0
}
]
}'
# Run a scenario
curl -X POST "https://api.gremlin.com/v1/scenarios/SCENARIO_ID/runs" \
-H "Authorization: Key $GREMLIN_API_KEY"
# List scenario runs
curl -s "https://api.gremlin.com/v1/scenarios/SCENARIO_ID/runs" \
-H "Authorization: Key $GREMLIN_API_KEY" | jq '.[0] | {guid, state, startTime}'
Status Checks (Safety Validation)
# Create a status check (halt condition)
curl -X POST "https://api.gremlin.com/v1/status-checks" \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "API Health Check",
"description": "Verify API returns 200",
"evaluationPeriodSec": 30,
"checks": [
{
"source": "http",
"httpCheck": {
"url": "https://api.example.com/health",
"method": "GET",
"expectedStatusCode": 200,
"timeoutMs": 5000
}
}
]
}'
# Attach status check to attack as halt condition
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
-H "Authorization: Key $GREMLIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"command": {"type": "cpu", "args": ["-c", "2", "-p", "90", "-l", "120"]},
"target": {"type": "Exact", "exact": ["host-id"]},
"statusCheck": {"id": "status-check-id"}
}'
Halt and Manage Attacks
# List all active attacks
curl -s "https://api.gremlin.com/v1/attacks/active" \
-H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {guid, type: .command.type, state}'
# Halt a specific attack
curl -X DELETE "https://api.gremlin.com/v1/attacks/ATTACK_GUID" \
-H "Authorization: Key $GREMLIN_API_KEY"
# Halt ALL active attacks (emergency)
curl -X POST "https://api.gremlin.com/v1/attacks/halt" \
-H "Authorization: Key $GREMLIN_API_KEY"
# Get attack details and results
curl -s "https://api.gremlin.com/v1/attacks/ATTACK_GUID" \
-H "Authorization: Key $GREMLIN_API_KEY" | jq '{type: .command.type, state, startTime, endTime}'
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Agent not registering | Invalid team credentials | Verify team ID and secret with gremlin check auth |
| Attack not reaching target | Agent offline or tags mismatch | Check agent status in dashboard; verify host tags |
| Network attacks no effect | Agent lacks NET_ADMIN capability | Ensure agent runs with --cap-add=NET_ADMIN |
| Kubernetes pods not targeted | Agent DaemonSet not on target nodes | Check tolerations and node selectors on DaemonSet |
| Attack auto-halted | Status check failed | Review status check logs; adjust thresholds |
| Permission denied | Insufficient API key scope | Regenerate API key with required permissions |
| Agent high memory usage | Process collection enabled on busy host | Disable collectProcesses or increase resource limits |
| Scenario stuck | One step’s attack target unreachable | Check each step’s target availability; halt and retry |
# Check agent status
sudo systemctl status gremlind
sudo journalctl -u gremlind --tail=50
# Verify agent connectivity
gremlin check auth
gremlin check network
# List registered clients
curl -s "https://api.gremlin.com/v1/clients" \
-H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {id: .identifier, tags, online: .active}'
# View attack history
curl -s "https://api.gremlin.com/v1/attacks?pageSize=10" \
-H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {guid, type: .command.type, state}'