تخطَّ إلى المحتوى

Gremlin Cheat Sheet

Overview

Gremlin is an enterprise chaos engineering platform that enables teams to proactively test system resilience by injecting controlled failures into infrastructure, applications, and networks. Unlike open-source alternatives, Gremlin provides a fully managed SaaS experience with an intuitive UI, comprehensive safety controls including automatic halt conditions, and built-in best practices that guide teams through their chaos engineering journey from simple experiments to complex multi-target scenarios.

The platform supports chaos attacks across Kubernetes, cloud VMs, containers, bare metal servers, and serverless environments. Gremlin organizes experiments into three categories: resource attacks (CPU, memory, disk, I/O), network attacks (latency, packet loss, DNS, blackhole), and state attacks (process kill, time travel, shutdown). Its Scenarios feature allows teams to build reusable multi-step attack sequences that model real-world failure modes, while Status Checks validate system health during experiments.

Installation

Gremlin Agent on Linux

# Add Gremlin APT repository (Debian/Ubuntu)
echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 05AE748A3A9EC419
sudo apt-get update
sudo apt-get install -y gremlin gremlind

# Add Gremlin YUM repository (RHEL/CentOS)
sudo rpm --import https://rpm.gremlin.com/gpg-key
cat <<EOF | sudo tee /etc/yum.repos.d/gremlin.repo
[gremlin]
name=Gremlin
baseurl=https://rpm.gremlin.com/
gpgcheck=1
gpgkey=https://rpm.gremlin.com/gpg-key
EOF
sudo yum install -y gremlin gremlind

# Initialize Gremlin agent with credentials
sudo gremlin init --team-id "YOUR_TEAM_ID" \
  --team-secret "YOUR_TEAM_SECRET"

# Start the agent
sudo systemctl start gremlind
sudo systemctl enable gremlind

# Verify agent is connected
gremlin check auth

Kubernetes Installation

# Add Gremlin Helm repo
helm repo add gremlin https://helm.gremlin.com
helm repo update

# Create namespace and secret
kubectl create namespace gremlin
kubectl create secret generic gremlin-team-secret \
  --namespace gremlin \
  --from-literal=GREMLIN_TEAM_ID="YOUR_TEAM_ID" \
  --from-literal=GREMLIN_TEAM_SECRET="YOUR_TEAM_SECRET"

# Install Gremlin agent (Kubernetes)
helm install gremlin gremlin/gremlin \
  --namespace gremlin \
  --set gremlin.secret.name=gremlin-team-secret \
  --set gremlin.secret.type=secret \
  --set gremlin.collect.processes=true \
  --set gremlin.collect.dns=true

# Verify pods are running
kubectl get pods -n gremlin

Gremlin CLI

# Install Gremlin CLI
pip install gremlinapi

# Or via npm
npm install -g gremlin-cli

# Configure API credentials
export GREMLIN_API_KEY="your-api-key"
export GREMLIN_TEAM_ID="your-team-id"

# Verify CLI access
gremlin version

Core Commands — Resource Attacks

CPU Attack

# CPU attack via CLI
gremlin attack cpu \
  --length 120 \
  --cores 2 \
  --percent 80 \
  --target-tags "service=api,env=staging"

# CPU attack via API
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {
      "type": "cpu",
      "args": ["-c", "2", "-p", "80", "-l", "120"]
    },
    "target": {
      "type": "Exact",
      "exact": ["host-identifier"]
    }
  }'

Memory Attack

# Memory attack — consume 512MB for 60 seconds
gremlin attack memory \
  --length 60 \
  --amount 512 \
  --unit MB \
  --target-tags "service=cache,env=staging"

# Memory attack via API
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {
      "type": "memory",
      "args": ["-g", "512", "-u", "MB", "-l", "60"]
    },
    "target": {
      "type": "Random",
      "tags": {"service": "cache", "env": "staging"},
      "percent": 50
    }
  }'

Disk and I/O Attacks

# Disk space attack — fill 90% of disk
gremlin attack disk \
  --length 60 \
  --dir /tmp \
  --workers 1 \
  --block-size 4 \
  --percent 90 \
  --target-tags "role=worker"

# I/O attack — high IOPS
gremlin attack io \
  --length 120 \
  --dir /tmp \
  --workers 4 \
  --mode rw \
  --block-size 64 \
  --block-count 1000 \
  --target-tags "service=database"

Core Commands — Network Attacks

# Network latency attack
gremlin attack latency \
  --length 120 \
  --ms 200 \
  --jitter 50 \
  --target-tags "service=frontend,env=staging"

# Packet loss attack
gremlin attack packet-loss \
  --length 60 \
  --percent 30 \
  --corrupt 5 \
  --target-tags "service=api"

# DNS blackhole — block DNS resolution
gremlin attack dns \
  --length 60 \
  --target-tags "service=api" \
  -- -d "external-api.example.com"

# Blackhole — drop all network traffic to specific IPs
gremlin attack blackhole \
  --length 60 \
  --target-tags "service=api" \
  -- -h "10.0.1.50" -p "^5432"

# Bandwidth limit
gremlin attack bandwidth \
  --length 120 \
  --amount 1 \
  --unit mbps \
  --target-tags "service=data-sync"

Core Commands — State Attacks

# Process kill attack
gremlin attack process-killer \
  --length 60 \
  --interval 15 \
  --target-tags "service=worker" \
  -- -p "worker-process" --exact

# Shutdown attack (graceful)
gremlin attack shutdown \
  --delay 30 \
  --reboot \
  --target-tags "role=canary,env=staging"

# Time travel attack (clock skew)
gremlin attack time-travel \
  --length 60 \
  --offset 3600 \
  --target-tags "service=scheduler"

Core Commands — Kubernetes Attacks

# Target Kubernetes pods
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {
      "type": "cpu",
      "args": ["-c", "1", "-p", "90", "-l", "60"]
    },
    "target": {
      "type": "Exact",
      "strategy": {
        "type": "Kubernetes",
        "k8sObjects": [
          {
            "cluster": "production",
            "namespace": "default",
            "kind": "Deployment",
            "name": "api-server",
            "containerSelection": {
              "selectionType": "ANY"
            }
          }
        ]
      }
    }
  }'

# Kill specific Kubernetes pods
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {
      "type": "process_killer",
      "args": ["-l", "30", "--kill-children"]
    },
    "target": {
      "type": "Random",
      "strategy": {
        "type": "Kubernetes",
        "k8sObjects": [
          {
            "cluster": "production",
            "namespace": "default",
            "kind": "Deployment",
            "name": "web-frontend"
          }
        ]
      },
      "percent": 33
    }
  }'

Configuration

Gremlin Agent Configuration

# /etc/gremlin/config.yaml
teamId: "your-team-id"
teamSecret: "your-team-secret"

# Or use certificate-based auth
teamId: "your-team-id"
teamCertificate: "/etc/gremlin/certs/team.cert"
teamPrivateKey: "/etc/gremlin/certs/team.key"

# Agent settings
identifier: "custom-host-name"
tags:
  service: api-server
  environment: production
  region: us-east-1
  tier: critical

# Collection settings
collectProcesses: true
collectDns: true
collectServices: true

# Proxy settings (if behind corporate proxy)
httpsProxy: "http://proxy.internal:3128"

Helm Values for Kubernetes

# gremlin-values.yaml
gremlin:
  secret:
    name: gremlin-team-secret
    type: secret
  collect:
    processes: true
    dns: true
  hostPID: true
  hostNetwork: false
  client:
    tags:
      - cluster=production
      - region=us-east-1
  
image:
  repository: gremlin/gremlin
  tag: latest
  pullPolicy: IfNotPresent

resources:
  limits:
    cpu: 200m
    memory: 256Mi
  requests:
    cpu: 100m
    memory: 128Mi

tolerations:
  - key: "node-role.kubernetes.io/master"
    effect: "NoSchedule"

Advanced Usage

Scenarios (Multi-Step Attacks)

# Create a scenario via API
curl -X POST "https://api.gremlin.com/v1/scenarios" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Cascading Failure Test",
    "description": "Simulate cascading failure from database to API",
    "hypothesis": "Circuit breakers prevent cascade",
    "steps": [
      {
        "attacks": [
          {
            "command": {"type": "latency", "args": ["-m", "500", "-l", "120"]},
            "target": {
              "type": "Exact",
              "strategy": {"type": "Kubernetes", "k8sObjects": [
                {"cluster": "prod", "namespace": "default", "kind": "Deployment", "name": "database-proxy"}
              ]}
            }
          }
        ],
        "delay": 30
      },
      {
        "attacks": [
          {
            "command": {"type": "cpu", "args": ["-c", "2", "-p", "95", "-l", "60"]},
            "target": {
              "type": "Random",
              "tags": {"service": "api-server"},
              "percent": 50
            }
          }
        ],
        "delay": 0
      }
    ]
  }'

# Run a scenario
curl -X POST "https://api.gremlin.com/v1/scenarios/SCENARIO_ID/runs" \
  -H "Authorization: Key $GREMLIN_API_KEY"

# List scenario runs
curl -s "https://api.gremlin.com/v1/scenarios/SCENARIO_ID/runs" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '.[0] | {guid, state, startTime}'

Status Checks (Safety Validation)

# Create a status check (halt condition)
curl -X POST "https://api.gremlin.com/v1/status-checks" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "API Health Check",
    "description": "Verify API returns 200",
    "evaluationPeriodSec": 30,
    "checks": [
      {
        "source": "http",
        "httpCheck": {
          "url": "https://api.example.com/health",
          "method": "GET",
          "expectedStatusCode": 200,
          "timeoutMs": 5000
        }
      }
    ]
  }'

# Attach status check to attack as halt condition
curl -X POST "https://api.gremlin.com/v1/attacks/new" \
  -H "Authorization: Key $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "command": {"type": "cpu", "args": ["-c", "2", "-p", "90", "-l", "120"]},
    "target": {"type": "Exact", "exact": ["host-id"]},
    "statusCheck": {"id": "status-check-id"}
  }'

Halt and Manage Attacks

# List all active attacks
curl -s "https://api.gremlin.com/v1/attacks/active" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {guid, type: .command.type, state}'

# Halt a specific attack
curl -X DELETE "https://api.gremlin.com/v1/attacks/ATTACK_GUID" \
  -H "Authorization: Key $GREMLIN_API_KEY"

# Halt ALL active attacks (emergency)
curl -X POST "https://api.gremlin.com/v1/attacks/halt" \
  -H "Authorization: Key $GREMLIN_API_KEY"

# Get attack details and results
curl -s "https://api.gremlin.com/v1/attacks/ATTACK_GUID" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '{type: .command.type, state, startTime, endTime}'

Troubleshooting

IssueCauseSolution
Agent not registeringInvalid team credentialsVerify team ID and secret with gremlin check auth
Attack not reaching targetAgent offline or tags mismatchCheck agent status in dashboard; verify host tags
Network attacks no effectAgent lacks NET_ADMIN capabilityEnsure agent runs with --cap-add=NET_ADMIN
Kubernetes pods not targetedAgent DaemonSet not on target nodesCheck tolerations and node selectors on DaemonSet
Attack auto-haltedStatus check failedReview status check logs; adjust thresholds
Permission deniedInsufficient API key scopeRegenerate API key with required permissions
Agent high memory usageProcess collection enabled on busy hostDisable collectProcesses or increase resource limits
Scenario stuckOne step’s attack target unreachableCheck each step’s target availability; halt and retry
# Check agent status
sudo systemctl status gremlind
sudo journalctl -u gremlind --tail=50

# Verify agent connectivity
gremlin check auth
gremlin check network

# List registered clients
curl -s "https://api.gremlin.com/v1/clients" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {id: .identifier, tags, online: .active}'

# View attack history
curl -s "https://api.gremlin.com/v1/attacks?pageSize=10" \
  -H "Authorization: Key $GREMLIN_API_KEY" | jq '.[] | {guid, type: .command.type, state}'