Salta ai contenuti

Runbook Cheat Sheet

Overview

Runbooks are structured documents that codify operational procedures for routine tasks, incident response, and troubleshooting workflows. They transform tribal knowledge into repeatable, executable processes that any team member can follow regardless of experience level. Modern runbooks go beyond static documentation by incorporating automated steps, decision trees, and integration with monitoring and alerting systems.

Runbook automation bridges the gap between manual operations and full automation by providing semi-automated workflows where human judgment is still required for critical decisions. Tools like Rundeck, AWS Systems Manager, and custom scripts enable teams to execute runbook steps programmatically, reducing mean time to recovery (MTTR) and minimizing human error during high-stress incidents. Well-maintained runbooks are a cornerstone of SRE practices and operational excellence.

Installation

Rundeck (Self-Hosted Runbook Automation)

# Install Rundeck via Docker
docker run -d \
  --name rundeck \
  -p 4440:4440 \
  -e RUNDECK_GRAILS_URL=http://localhost:4440 \
  -v rundeck-data:/home/rundeck/server/data \
  rundeck/rundeck:4.17.0

# Install via apt (Debian/Ubuntu)
curl https://raw.githubusercontent.com/rundeck/packaging/main/scripts/deb-setup.sh | sudo bash
sudo apt-get install rundeck

# Install via yum (RHEL/CentOS)
curl https://raw.githubusercontent.com/rundeck/packaging/main/scripts/rpm-setup.sh | sudo bash
sudo yum install rundeck

# Start Rundeck
sudo systemctl start rundeckd
sudo systemctl enable rundeckd

AWS Systems Manager (SSM) Runbooks

# Install AWS CLI (if not already installed)
pip install awscli

# Verify SSM access
aws ssm describe-document --name "AWS-RunShellScript"

# List available SSM automation documents
aws ssm list-documents --document-filter-list \
  "key=DocumentType,value=Automation" --max-results 20

Jupyter Runbooks

# Install Jupyter for interactive runbooks
pip install jupyterlab ipywidgets papermill

# Install runbook-specific extensions
pip install jupyter-runbook nbformat

# Start Jupyter for runbook authoring
jupyter lab --port 8888

Core Concepts — Runbook Structure

Standard Runbook Template (Markdown)

# Runbook: Service Recovery — Payment API
# Owner: Platform Engineering
# Last Updated: 2026-05-18
# Severity: P1

## Metadata
- **Service**: payment-api
- **Escalation Contact**: oncall-payments@company.com
- **Dashboards**: [Grafana](https://grafana.internal/d/payments)
- **Runbook ID**: RB-PAY-001

## Symptoms
- Payment API error rate > 5%
- Latency P99 > 2000ms
- Alert: `PaymentAPIHighErrorRate`

## Diagnosis Steps
1. Check service health endpoint
2. Review recent deployments
3. Verify database connectivity
4. Check downstream dependencies

## Resolution Steps
### Step 1: Verify the Issue
### Step 2: Identify Root Cause
### Step 3: Apply Fix
### Step 4: Verify Resolution

## Rollback Procedure
## Post-Incident Actions

Executable Runbook (Bash)

#!/bin/bash
# Runbook: Database Connection Pool Recovery
# RB-DB-001 | Owner: SRE Team
set -euo pipefail

LOG_FILE="/var/log/runbooks/db-pool-recovery-$(date +%Y%m%d-%H%M%S).log"
exec > >(tee -a "$LOG_FILE") 2>&1

echo "=== Database Connection Pool Recovery Runbook ==="
echo "Started: $(date -u)"
echo "Operator: $(whoami)"

# Step 1: Diagnose
echo -e "\n--- Step 1: Check connection pool status ---"
ACTIVE=$(psql -h prod-db.internal -U monitor -t -c \
  "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';")
IDLE=$(psql -h prod-db.internal -U monitor -t -c \
  "SELECT count(*) FROM pg_stat_activity WHERE state = 'idle';")
TOTAL=$(psql -h prod-db.internal -U monitor -t -c \
  "SELECT count(*) FROM pg_stat_activity;")

echo "Active: $ACTIVE | Idle: $IDLE | Total: $TOTAL"

# Step 2: Decision point
if [ "$TOTAL" -gt 90 ]; then
  echo "WARNING: Connection pool near capacity (${TOTAL}/100)"
  echo "--- Step 2: Terminate idle connections ---"
  
  read -p "Terminate idle connections older than 5 minutes? [y/N]: " confirm
  if [ "$confirm" = "y" ]; then
    psql -h prod-db.internal -U admin -c \
      "SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
       WHERE state = 'idle' AND query_start < now() - interval '5 minutes';"
    echo "Idle connections terminated."
  fi
fi

# Step 3: Verify
echo -e "\n--- Step 3: Verify recovery ---"
sleep 10
NEW_TOTAL=$(psql -h prod-db.internal -U monitor -t -c \
  "SELECT count(*) FROM pg_stat_activity;")
echo "Connections after recovery: $NEW_TOTAL"

if [ "$NEW_TOTAL" -lt 80 ]; then
  echo "SUCCESS: Connection pool recovered"
else
  echo "ESCALATE: Pool still near capacity — escalate to DBA team"
fi

echo -e "\nCompleted: $(date -u)"

Core Patterns — Incident Response Runbooks

Health Check Pattern

#!/bin/bash
# Generic service health check runbook
SERVICE_NAME="${1:?Usage: $0 <service-name>}"
NAMESPACE="${2:-production}"

echo "=== Health Check: $SERVICE_NAME ==="

# Check Kubernetes pods
echo "--- Pod Status ---"
kubectl get pods -n "$NAMESPACE" -l "app=$SERVICE_NAME" \
  -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount,AGE:.metadata.creationTimestamp"

# Check recent events
echo -e "\n--- Recent Events ---"
kubectl get events -n "$NAMESPACE" --field-selector "involvedObject.name=$SERVICE_NAME" \
  --sort-by='.lastTimestamp' | tail -10

# Check resource usage
echo -e "\n--- Resource Usage ---"
kubectl top pods -n "$NAMESPACE" -l "app=$SERVICE_NAME"

# Check endpoints
echo -e "\n--- Endpoint Health ---"
CLUSTER_IP=$(kubectl get svc "$SERVICE_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.clusterIP}')
PORT=$(kubectl get svc "$SERVICE_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.ports[0].port}')
kubectl run health-check --rm -i --restart=Never --image=curlimages/curl -- \
  curl -sf "http://${CLUSTER_IP}:${PORT}/health" || echo "Health check FAILED"

# Check recent logs for errors
echo -e "\n--- Recent Errors ---"
kubectl logs -n "$NAMESPACE" -l "app=$SERVICE_NAME" --tail=50 --since=15m \
  | grep -iE "error|exception|fatal|panic" | tail -10

Rollback Pattern

#!/bin/bash
# Deployment rollback runbook
SERVICE_NAME="${1:?Usage: $0 <service-name> [revision]}"
REVISION="${2:-}"
NAMESPACE="production"

echo "=== Rollback Runbook: $SERVICE_NAME ==="

# Step 1: Document current state
echo "--- Current Deployment ---"
kubectl get deployment "$SERVICE_NAME" -n "$NAMESPACE" \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
echo ""

# Step 2: Check rollout history
echo -e "\n--- Rollout History ---"
kubectl rollout history deployment "$SERVICE_NAME" -n "$NAMESPACE"

# Step 3: Perform rollback
if [ -n "$REVISION" ]; then
  echo -e "\nRolling back to revision $REVISION..."
  kubectl rollout undo deployment "$SERVICE_NAME" -n "$NAMESPACE" \
    --to-revision="$REVISION"
else
  echo -e "\nRolling back to previous revision..."
  kubectl rollout undo deployment "$SERVICE_NAME" -n "$NAMESPACE"
fi

# Step 4: Monitor rollback
echo -e "\n--- Monitoring Rollback ---"
kubectl rollout status deployment "$SERVICE_NAME" -n "$NAMESPACE" --timeout=300s

# Step 5: Verify
echo -e "\n--- Post-Rollback Verification ---"
NEW_IMAGE=$(kubectl get deployment "$SERVICE_NAME" -n "$NAMESPACE" \
  -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Current image: $NEW_IMAGE"

kubectl get pods -n "$NAMESPACE" -l "app=$SERVICE_NAME" \
  -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,READY:.status.conditions[?(@.type=='Ready')].status"

Configuration

Rundeck Job Definition

# rundeck-job.yaml
- defaultTab: nodes
  description: "Restart application service with health verification"
  executionEnabled: true
  group: operations/restart
  loglevel: INFO
  name: restart-service
  nodeFilterEditable: true
  notification:
    onfailure:
      plugin:
        type: SlackNotification
        configuration:
          webhook_url: https://hooks.slack.com/services/xxx
    onsuccess:
      plugin:
        type: SlackNotification
        configuration:
          webhook_url: https://hooks.slack.com/services/xxx
  options:
    - name: service_name
      description: "Service to restart"
      required: true
      enforced: true
      values: [api-server, worker, scheduler, gateway]
    - name: environment
      description: "Target environment"
      required: true
      enforced: true
      values: [staging, production]
      value: staging
  sequence:
    commands:
      - description: "Pre-check: verify service exists"
        exec: systemctl is-enabled ${option.service_name}
      - description: "Graceful restart"
        exec: systemctl restart ${option.service_name}
      - description: "Wait for startup"
        exec: sleep 15
      - description: "Verify health"
        exec: curl -sf http://localhost:8080/health || exit 1
    keepgoing: false
    strategy: node-first
  scheduleEnabled: false

AWS SSM Automation Document

# ssm-runbook.yaml
schemaVersion: '0.3'
description: 'Runbook: EC2 Instance Recovery'
assumeRole: '{{ AutomationAssumeRole }}'
parameters:
  InstanceId:
    type: String
    description: 'EC2 instance to recover'
  AutomationAssumeRole:
    type: String
    description: 'IAM role for automation'
mainSteps:
  - name: checkInstanceState
    action: 'aws:executeAwsApi'
    inputs:
      Service: ec2
      Api: DescribeInstanceStatus
      InstanceIds:
        - '{{ InstanceId }}'
    outputs:
      - Name: InstanceState
        Selector: '$.InstanceStatuses[0].InstanceState.Name'
        Type: String

  - name: stopInstance
    action: 'aws:changeInstanceState'
    inputs:
      InstanceIds:
        - '{{ InstanceId }}'
      DesiredState: stopped

  - name: startInstance
    action: 'aws:changeInstanceState'
    inputs:
      InstanceIds:
        - '{{ InstanceId }}'
      DesiredState: running

  - name: verifyRecovery
    action: 'aws:waitForAwsResourceProperty'
    timeoutSeconds: 300
    inputs:
      Service: ec2
      Api: DescribeInstanceStatus
      InstanceIds:
        - '{{ InstanceId }}'
      PropertySelector: '$.InstanceStatuses[0].InstanceStatus.Status'
      DesiredValues:
        - ok

Advanced Usage

Parameterized Runbook with Papermill

# Execute a Jupyter runbook with parameters
papermill incident-investigation.ipynb output-$(date +%s).ipynb \
  -p service_name "payment-api" \
  -p time_range "1h" \
  -p severity "P1" \
  -p incident_id "INC-2026-0518"

# Batch execute runbook across environments
for env in staging production; do
  papermill scaling-check.ipynb "output-${env}.ipynb" \
    -p environment "$env" \
    -p threshold 80
done

Decision Tree Automation

#!/bin/bash
# Automated triage runbook with decision tree
diagnose_high_latency() {
  local service="$1"
  
  echo "=== Diagnosing High Latency: $service ==="
  
  # Check CPU
  CPU=$(kubectl top pods -l "app=$service" -n prod --no-headers \
    | awk '{sum+=$2} END {print sum/NR}' | sed 's/m//')
  
  if [ "$CPU" -gt 800 ]; then
    echo "FINDING: High CPU (${CPU}m) — likely compute-bound"
    echo "ACTION: Scale horizontally"
    kubectl scale deployment "$service" -n prod --replicas=+2
    return
  fi
  
  # Check memory
  MEM_PCT=$(kubectl top pods -l "app=$service" -n prod --no-headers \
    | awk '{print $3}' | sed 's/Mi//' | sort -rn | head -1)
  LIMIT=$(kubectl get deployment "$service" -n prod \
    -o jsonpath='{.spec.template.spec.containers[0].resources.limits.memory}' | sed 's/Mi//')
  
  if [ $((MEM_PCT * 100 / LIMIT)) -gt 85 ]; then
    echo "FINDING: Memory pressure (${MEM_PCT}Mi / ${LIMIT}Mi)"
    echo "ACTION: Restart pods to clear memory"
    kubectl rollout restart deployment "$service" -n prod
    return
  fi
  
  # Check downstream dependencies
  echo "FINDING: CPU/Memory normal — checking dependencies"
  echo "ACTION: Run dependency health checks"
  for dep in database cache queue; do
    echo -n "  $dep: "
    kubectl exec -n prod deploy/"$service" -- curl -sf "http://${dep}:8080/health" \
      && echo "OK" || echo "FAILING"
  done
}

Runbook Testing Framework

#!/bin/bash
# Test runbooks in staging before production use
test_runbook() {
  local runbook="$1"
  local env="staging"
  
  echo "=== Testing Runbook: $runbook ==="
  echo "Environment: $env"
  echo "Started: $(date -u)"
  
  # Create isolated test namespace
  kubectl create namespace "runbook-test-$(date +%s)" || true
  
  # Execute runbook in dry-run mode
  DRY_RUN=true ENVIRONMENT="$env" bash "$runbook" 2>&1 | tee "/tmp/runbook-test-$$.log"
  
  EXIT_CODE=${PIPESTATUS[0]}
  
  if [ $EXIT_CODE -eq 0 ]; then
    echo "PASS: Runbook executed successfully"
  else
    echo "FAIL: Runbook exited with code $EXIT_CODE"
  fi
  
  return $EXIT_CODE
}

# Run all runbook tests
for rb in runbooks/*.sh; do
  test_runbook "$rb"
done

Troubleshooting

IssueCauseSolution
Runbook fails silentlyMissing set -euo pipefailAdd strict mode to all bash runbooks
Stale runbook contentNot reviewed regularlySchedule quarterly runbook reviews
Variables undefinedEnvironment not sourcedSource env files at runbook start
Timeout during executionNetwork or service latencyAdd timeouts to all network calls
Permission deniedInsufficient RBACVerify service account has required roles
Rundeck job hangsNode unreachableCheck SSH connectivity and node filters
SSM document failsIAM role missing permissionsVerify AssumeRole policy includes required actions
Papermill kernel diesMemory limits exceededIncrease container memory or optimize notebook
# Validate runbook syntax
bash -n runbook.sh && echo "Syntax OK" || echo "Syntax errors found"

# Check for common issues
shellcheck runbook.sh

# Test with debug output
bash -x runbook.sh 2>&1 | tee debug-output.log

# Verify required tools are available
for tool in kubectl curl jq psql; do
  command -v "$tool" >/dev/null 2>&1 && echo "$tool: OK" || echo "$tool: MISSING"
done