Gremlin Cheat Sheet

Overview

Gremlin is an enterprise chaos engineering platform that enables teams to proactively test system resilience by injecting controlled failures into infrastructure, applications, and networks. Unlike open-source alternatives, Gremlin provides a fully managed SaaS experience with an intuitive UI, comprehensive safety controls including automatic halt conditions, and built-in best practices that guide teams through their chaos engineering journey from simple experiments to complex multi-target scenarios.

The platform supports chaos attacks across Kubernetes, cloud VMs, containers, bare metal servers, and serverless environments. Gremlin organizes experiments into three categories: resource attacks (CPU, memory, disk, I/O), network attacks (latency, packet loss, DNS, blackhole), and state attacks (process kill, time travel, shutdown). Its Scenarios feature allows teams to build reusable multi-step attack sequences that model real-world failure modes, while Status Checks validate system health during experiments.

Installation

Gremlin Agent on Linux

# Add Gremlin APT repository (Debian/Ubuntu)
echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 05AE748A3A9EC419
sudo apt-get update
sudo apt-get install -y gremlin gremlind

# Add Gremlin YUM repository (RHEL/CentOS)
sudo rpm --import https://rpm.gremlin.com/gpg-key
cat <<EOF | sudo tee /etc/yum.repos.d/gremlin.repo
[gremlin]
name=Gremlin
baseurl=https://rpm.gremlin.com/
gpgcheck=1
gpgkey=https://rpm.gremlin.com/gpg-key
EOF
sudo yum install -y gremlin gremlind

# Initialize Gremlin agent with credentials
sudo gremlin init --team-id "YOUR_TEAM_ID" \
  --team-secret "YOUR_TEAM_SECRET"

# Start the agent
sudo systemctl start gremlind
sudo systemctl enable gremlind

# Verify agent is connected
gremlin check auth

Kubernetes Installation

# Add Gremlin Helm repo
helm repo add gremlin https://helm.gremlin.com
helm repo update

# Create namespace and secret
kubectl create namespace gremlin
kubectl create secret generic gremlin-team-secret \
  --namespace gremlin \
  --from-literal=GREMLIN_TEAM_ID="YOUR_TEAM_ID" \
  --from-literal=GREMLIN_TEAM_SECRET="YOUR_TEAM_SECRET"

# Install Gremlin agent (Kubernetes)
helm install gremlin gremlin/gremlin \
  --namespace gremlin \
  --set gremlin.secret.name=gremlin-team-secret \
  --set gremlin.secret.type=secret \
  --set gremlin.collect.processes=true \
  --set gremlin.collect.dns=true

# Verify pods are running
kubectl get pods -n gremlin

Gremlin CLI

# Install Gremlin CLI
pip install gremlinapi

# Or via npm
npm install -g gremlin-cli

# Configure API credentials
export GREMLIN_API_KEY="your-api-key"
export GREMLIN_TEAM_ID="your-team-id"

# Verify CLI access
gremlin version

Core Commands — Resource Attacks

CPU Attack

# CPU attack via CLI
gremlin attack cpu \
  --length 120 \
  --cores 2 \
  --percent 80 \
  --target-tags "service=api,env=staging"

# CPU attack via API
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {
      "type": "cpu",
      "args": ["-c", "2", "-p", "80", "-l", "120"]
    },
    "target": {
      "type": "Exact",
      "exact": ["host-identifier"]
    }
  }'

Memory Attack

# Memory attack — consume 512MB for 60 seconds
gremlin attack memory \
  --length 60 \
  --amount 512 \
  --unit MB \
  --target-tags "service=cache,env=staging"

# Memory attack via API
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {
      "type": "memory",
      "args": ["-g", "512", "-u", "MB", "-l", "60"]
    },
    "target": {
      "type": "Random",
      "tags": {"service": "cache", "env": "staging"},
      "percent": 50
    }
  }'

Disk and I/O Attacks

# Disk space attack — fill 90% of disk
gremlin attack disk \
  --length 60 \
  --dir /tmp \
  --workers 1 \
  --block-size 4 \
  --percent 90 \
  --target-tags "role=worker"

# I/O attack — high IOPS
gremlin attack io \
  --length 120 \
  --dir /tmp \
  --workers 4 \
  --mode rw \
  --block-size 64 \
  --block-count 1000 \
  --target-tags "service=database"

Core Commands — Network Attacks

# Network latency attack
gremlin attack latency \
  --length 120 \
  --ms 200 \
  --jitter 50 \
  --target-tags "service=frontend,env=staging"

# Packet loss attack
gremlin attack packet-loss \
  --length 60 \
  --percent 30 \
  --corrupt 5 \
  --target-tags "service=api"

# DNS blackhole — block DNS resolution
gremlin attack dns \
  --length 60 \
  --target-tags "service=api" \
  -- -d "external-api.example.com"

# Blackhole — drop all network traffic to specific IPs
gremlin attack blackhole \
  --length 60 \
  --target-tags "service=api" \
  -- -h "10.0.1.50" -p "^5432"

# Bandwidth limit
gremlin attack bandwidth \
  --length 120 \
  --amount 1 \
  --unit mbps \
  --target-tags "service=data-sync"

Core Commands — State Attacks

# Process kill attack
gremlin attack process-killer \
  --length 60 \
  --interval 15 \
  --target-tags "service=worker" \
  -- -p "worker-process" --exact

# Shutdown attack (graceful)
gremlin attack shutdown \
  --delay 30 \
  --reboot \
  --target-tags "role=canary,env=staging"

# Time travel attack (clock skew)
gremlin attack time-travel \
  --length 60 \
  --offset 3600 \
  --target-tags "service=scheduler"

Core Commands — Kubernetes Attacks

# Target Kubernetes pods
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {
      "type": "cpu",
      "args": ["-c", "1", "-p", "90", "-l", "60"]
    },
    "target": {
      "type": "Exact",
      "strategy": {
        "type": "Kubernetes",
        "k8sObjects": [
          {
            "cluster": "production",
            "namespace": "default",
            "kind": "Deployment",
            "name": "api-server",
            "containerSelection": {
              "selectionType": "ANY"
            }
          }
        ]
      }
    }
  }'

# Kill specific Kubernetes pods
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {
      "type": "process_killer",
      "args": ["-l", "30", "--kill-children"]
    },
    "target": {
      "type": "Random",
      "strategy": {
        "type": "Kubernetes",
        "k8sObjects": [
          {
            "cluster": "production",
            "namespace": "default",
            "kind": "Deployment",
            "name": "web-frontend"
          }
        ]
      },
      "percent": 33
    }
  }'

Configuration

Gremlin Agent Configuration

# /etc/gremlin/config.yaml
teamId: "your-team-id"
teamSecret: "your-team-secret"

# Or use certificate-based auth
teamId: "your-team-id"
teamCertificate: "/etc/gremlin/certs/team.cert"
teamPrivateKey: "/etc/gremlin/certs/team.key"

# Agent settings
identifier: "custom-host-name"
tags:
  service: api-server
  environment: production
  region: us-east-1
  tier: critical

# Collection settings
collectProcesses: true
collectDns: true
collectServices: true

# Proxy settings (if behind corporate proxy)
httpsProxy: "http://proxy.internal:3128"

Helm Values for Kubernetes

# gremlin-values.yaml
gremlin:
  secret:
    name: gremlin-team-secret
    type: secret
  collect:
    processes: true
    dns: true
  hostPID: true
  hostNetwork: false
  client:
    tags:
      - cluster=production
      - region=us-east-1
  
image:
  repository: gremlin/gremlin
  tag: latest
  pullPolicy: IfNotPresent

resources:
  limits:
    cpu: 200m
    memory: 256Mi
  requests:
    cpu: 100m
    memory: 128Mi

tolerations:
  - key: "node-role.kubernetes.io/master"
    effect: "NoSchedule"

Advanced Usage

Scenarios (Multi-Step Attacks)

# Create a scenario via API
curl -X POST "https://api.gremlin.com/v1/scenarios" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Cascading Failure Test",
    "description": "Simulate cascading failure from database to API",
    "hypothesis": "Circuit breakers prevent cascade",
    "steps": [
      {
        "attacks": [
          {
            "command": {"type": "latency", "args": ["-m", "500", "-l", "120"]},
            "target": {
              "type": "Exact",
              "strategy": {"type": "Kubernetes", "k8sObjects": [
                {"cluster": "prod", "namespace": "default", "kind": "Deployment", "name": "database-proxy"}
              ]}
            }
          }
        ],
        "delay": 30
      },
      {
        "attacks": [
          {
            "command": {"type": "cpu", "args": ["-c", "2", "-p", "95", "-l", "60"]},
            "target": {
              "type": "Random",
              "tags": {"service": "api-server"},
              "percent": 50
            }
          }
        ],
        "delay": 0
      }
    ]
  }'

# Run a scenario
curl -X POST "https://api.gremlin.com/v1/scenarios/SCENARIO_ID/runs" \
  -H "Authorization: Key $GREMLIN_API_KEY"

# List scenario runs
curl -s "https://api.gremlin.com/v1/scenarios/SCENARIO_ID/runs" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '.[0] | {guid, state, startTime}'

Status Checks (Safety Validation)

# Create a status check (halt condition)
curl -X POST "https://api.gremlin.com/v1/status-checks" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "API Health Check",
    "description": "Verify API returns 200",
    "evaluationPeriodSec": 30,
    "checks": [
      {
        "source": "http",
        "httpCheck": {
          "url": "https://api.example.com/health",
          "method": "GET",
          "expectedStatusCode": 200,
          "timeoutMs": 5000
        }
      }
    ]
  }'

# Attach status check to attack as halt condition
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {"type": "cpu", "args": ["-c", "2", "-p", "90", "-l", "120"]},
    "target": {"type": "Exact", "exact": ["host-id"]},
    "statusCheck": {"id": "status-check-id"}
  }'

Halt and Manage Attacks

# List all active attacks
curl -s "https://api.gremlin.com/v1/attacks/active" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {guid, type: .command.type, state}'

# Halt a specific attack
curl -X DELETE "https://api.gremlin.com/v1/attacks/ATTACK_GUID" \
  -H "Authorization: Key $GREMLIN_API_KEY"

# Halt ALL active attacks (emergency)
curl -X POST "https://api.gremlin.com/v1/attacks/halt" \
  -H "Authorization: Key $GREMLIN_API_KEY"

# Get attack details and results
curl -s "https://api.gremlin.com/v1/attacks/ATTACK_GUID" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '{type: .command.type, state, startTime, endTime}'

Troubleshooting

Issue	Cause	Solution
Agent not registering	Invalid team credentials	Verify team ID and secret with `gremlin check auth`
Attack not reaching target	Agent offline or tags mismatch	Check agent status in dashboard; verify host tags
Network attacks no effect	Agent lacks NET_ADMIN capability	Ensure agent runs with `--cap-add=NET_ADMIN`
Kubernetes pods not targeted	Agent DaemonSet not on target nodes	Check tolerations and node selectors on DaemonSet
Attack auto-halted	Status check failed	Review status check logs; adjust thresholds
Permission denied	Insufficient API key scope	Regenerate API key with required permissions
Agent high memory usage	Process collection enabled on busy host	Disable `collectProcesses` or increase resource limits
Scenario stuck	One step’s attack target unreachable	Check each step’s target availability; halt and retry

# Check agent status
sudo systemctl status gremlind
sudo journalctl -u gremlind --tail=50

# Verify agent connectivity
gremlin check auth
gremlin check network

# List registered clients
curl -s "https://api.gremlin.com/v1/clients" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {id: .identifier, tags, online: .active}'

# View attack history
curl -s "https://api.gremlin.com/v1/attacks?pageSize=10" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {guid, type: .command.type, state}'