Runbook Cheat Sheet
Overview
Runbooks are structured documents that codify operational procedures for routine tasks, incident response, and troubleshooting workflows. They transform tribal knowledge into repeatable, executable processes that any team member can follow regardless of experience level. Modern runbooks go beyond static documentation by incorporating automated steps, decision trees, and integration with monitoring and alerting systems.
Runbook automation bridges the gap between manual operations and full automation by providing semi-automated workflows where human judgment is still required for critical decisions. Tools like Rundeck, AWS Systems Manager, and custom scripts enable teams to execute runbook steps programmatically, reducing mean time to recovery (MTTR) and minimizing human error during high-stress incidents. Well-maintained runbooks are a cornerstone of SRE practices and operational excellence.
Installation
Rundeck (Self-Hosted Runbook Automation)
# Install Rundeck via Docker
docker run -d \
--name rundeck \
-p 4440:4440 \
-e RUNDECK_GRAILS_URL=http://localhost:4440 \
-v rundeck-data:/home/rundeck/server/data \
rundeck/rundeck:4.17.0
# Install via apt (Debian/Ubuntu)
curl https://raw.githubusercontent.com/rundeck/packaging/main/scripts/deb-setup.sh | sudo bash
sudo apt-get install rundeck
# Install via yum (RHEL/CentOS)
curl https://raw.githubusercontent.com/rundeck/packaging/main/scripts/rpm-setup.sh | sudo bash
sudo yum install rundeck
# Start Rundeck
sudo systemctl start rundeckd
sudo systemctl enable rundeckd
AWS Systems Manager (SSM) Runbooks
# Install AWS CLI (if not already installed)
pip install awscli
# Verify SSM access
aws ssm describe-document --name "AWS-RunShellScript"
# List available SSM automation documents
aws ssm list-documents --document-filter-list \
"key=DocumentType,value=Automation" --max-results 20
Jupyter Runbooks
# Install Jupyter for interactive runbooks
pip install jupyterlab ipywidgets papermill
# Install runbook-specific extensions
pip install jupyter-runbook nbformat
# Start Jupyter for runbook authoring
jupyter lab --port 8888
Core Concepts — Runbook Structure
Standard Runbook Template (Markdown)
# Runbook: Service Recovery — Payment API
# Owner: Platform Engineering
# Last Updated: 2026-05-18
# Severity: P1
## Metadata
- **Service**: payment-api
- **Escalation Contact**: oncall-payments@company.com
- **Dashboards**: [Grafana](https://grafana.internal/d/payments)
- **Runbook ID**: RB-PAY-001
## Symptoms
- Payment API error rate > 5%
- Latency P99 > 2000ms
- Alert: `PaymentAPIHighErrorRate`
## Diagnosis Steps
1. Check service health endpoint
2. Review recent deployments
3. Verify database connectivity
4. Check downstream dependencies
## Resolution Steps
### Step 1: Verify the Issue
### Step 2: Identify Root Cause
### Step 3: Apply Fix
### Step 4: Verify Resolution
## Rollback Procedure
## Post-Incident Actions
Executable Runbook (Bash)
#!/bin/bash
# Runbook: Database Connection Pool Recovery
# RB-DB-001 | Owner: SRE Team
set -euo pipefail
LOG_FILE="/var/log/runbooks/db-pool-recovery-$(date +%Y%m%d-%H%M%S).log"
exec > >(tee -a "$LOG_FILE") 2>&1
echo "=== Database Connection Pool Recovery Runbook ==="
echo "Started: $(date -u)"
echo "Operator: $(whoami)"
# Step 1: Diagnose
echo -e "\n--- Step 1: Check connection pool status ---"
ACTIVE=$(psql -h prod-db.internal -U monitor -t -c \
"SELECT count(*) FROM pg_stat_activity WHERE state = 'active';")
IDLE=$(psql -h prod-db.internal -U monitor -t -c \
"SELECT count(*) FROM pg_stat_activity WHERE state = 'idle';")
TOTAL=$(psql -h prod-db.internal -U monitor -t -c \
"SELECT count(*) FROM pg_stat_activity;")
echo "Active: $ACTIVE | Idle: $IDLE | Total: $TOTAL"
# Step 2: Decision point
if [ "$TOTAL" -gt 90 ]; then
echo "WARNING: Connection pool near capacity (${TOTAL}/100)"
echo "--- Step 2: Terminate idle connections ---"
read -p "Terminate idle connections older than 5 minutes? [y/N]: " confirm
if [ "$confirm" = "y" ]; then
psql -h prod-db.internal -U admin -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND query_start < now() - interval '5 minutes';"
echo "Idle connections terminated."
fi
fi
# Step 3: Verify
echo -e "\n--- Step 3: Verify recovery ---"
sleep 10
NEW_TOTAL=$(psql -h prod-db.internal -U monitor -t -c \
"SELECT count(*) FROM pg_stat_activity;")
echo "Connections after recovery: $NEW_TOTAL"
if [ "$NEW_TOTAL" -lt 80 ]; then
echo "SUCCESS: Connection pool recovered"
else
echo "ESCALATE: Pool still near capacity — escalate to DBA team"
fi
echo -e "\nCompleted: $(date -u)"
Core Patterns — Incident Response Runbooks
Health Check Pattern
#!/bin/bash
# Generic service health check runbook
SERVICE_NAME="${1:?Usage: $0 <service-name>}"
NAMESPACE="${2:-production}"
echo "=== Health Check: $SERVICE_NAME ==="
# Check Kubernetes pods
echo "--- Pod Status ---"
kubectl get pods -n "$NAMESPACE" -l "app=$SERVICE_NAME" \
-o custom-columns="NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount,AGE:.metadata.creationTimestamp"
# Check recent events
echo -e "\n--- Recent Events ---"
kubectl get events -n "$NAMESPACE" --field-selector "involvedObject.name=$SERVICE_NAME" \
--sort-by='.lastTimestamp' | tail -10
# Check resource usage
echo -e "\n--- Resource Usage ---"
kubectl top pods -n "$NAMESPACE" -l "app=$SERVICE_NAME"
# Check endpoints
echo -e "\n--- Endpoint Health ---"
CLUSTER_IP=$(kubectl get svc "$SERVICE_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.clusterIP}')
PORT=$(kubectl get svc "$SERVICE_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.ports[0].port}')
kubectl run health-check --rm -i --restart=Never --image=curlimages/curl -- \
curl -sf "http://${CLUSTER_IP}:${PORT}/health" || echo "Health check FAILED"
# Check recent logs for errors
echo -e "\n--- Recent Errors ---"
kubectl logs -n "$NAMESPACE" -l "app=$SERVICE_NAME" --tail=50 --since=15m \
| grep -iE "error|exception|fatal|panic" | tail -10
Rollback Pattern
#!/bin/bash
# Deployment rollback runbook
SERVICE_NAME="${1:?Usage: $0 <service-name> [revision]}"
REVISION="${2:-}"
NAMESPACE="production"
echo "=== Rollback Runbook: $SERVICE_NAME ==="
# Step 1: Document current state
echo "--- Current Deployment ---"
kubectl get deployment "$SERVICE_NAME" -n "$NAMESPACE" \
-o jsonpath='{.spec.template.spec.containers[0].image}'
echo ""
# Step 2: Check rollout history
echo -e "\n--- Rollout History ---"
kubectl rollout history deployment "$SERVICE_NAME" -n "$NAMESPACE"
# Step 3: Perform rollback
if [ -n "$REVISION" ]; then
echo -e "\nRolling back to revision $REVISION..."
kubectl rollout undo deployment "$SERVICE_NAME" -n "$NAMESPACE" \
--to-revision="$REVISION"
else
echo -e "\nRolling back to previous revision..."
kubectl rollout undo deployment "$SERVICE_NAME" -n "$NAMESPACE"
fi
# Step 4: Monitor rollback
echo -e "\n--- Monitoring Rollback ---"
kubectl rollout status deployment "$SERVICE_NAME" -n "$NAMESPACE" --timeout=300s
# Step 5: Verify
echo -e "\n--- Post-Rollback Verification ---"
NEW_IMAGE=$(kubectl get deployment "$SERVICE_NAME" -n "$NAMESPACE" \
-o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Current image: $NEW_IMAGE"
kubectl get pods -n "$NAMESPACE" -l "app=$SERVICE_NAME" \
-o custom-columns="NAME:.metadata.name,STATUS:.status.phase,READY:.status.conditions[?(@.type=='Ready')].status"
Configuration
Rundeck Job Definition
# rundeck-job.yaml
- defaultTab: nodes
description: "Restart application service with health verification"
executionEnabled: true
group: operations/restart
loglevel: INFO
name: restart-service
nodeFilterEditable: true
notification:
onfailure:
plugin:
type: SlackNotification
configuration:
webhook_url: https://hooks.slack.com/services/xxx
onsuccess:
plugin:
type: SlackNotification
configuration:
webhook_url: https://hooks.slack.com/services/xxx
options:
- name: service_name
description: "Service to restart"
required: true
enforced: true
values: [api-server, worker, scheduler, gateway]
- name: environment
description: "Target environment"
required: true
enforced: true
values: [staging, production]
value: staging
sequence:
commands:
- description: "Pre-check: verify service exists"
exec: systemctl is-enabled ${option.service_name}
- description: "Graceful restart"
exec: systemctl restart ${option.service_name}
- description: "Wait for startup"
exec: sleep 15
- description: "Verify health"
exec: curl -sf http://localhost:8080/health || exit 1
keepgoing: false
strategy: node-first
scheduleEnabled: false
AWS SSM Automation Document
# ssm-runbook.yaml
schemaVersion: '0.3'
description: 'Runbook: EC2 Instance Recovery'
assumeRole: '{{ AutomationAssumeRole }}'
parameters:
InstanceId:
type: String
description: 'EC2 instance to recover'
AutomationAssumeRole:
type: String
description: 'IAM role for automation'
mainSteps:
- name: checkInstanceState
action: 'aws:executeAwsApi'
inputs:
Service: ec2
Api: DescribeInstanceStatus
InstanceIds:
- '{{ InstanceId }}'
outputs:
- Name: InstanceState
Selector: '$.InstanceStatuses[0].InstanceState.Name'
Type: String
- name: stopInstance
action: 'aws:changeInstanceState'
inputs:
InstanceIds:
- '{{ InstanceId }}'
DesiredState: stopped
- name: startInstance
action: 'aws:changeInstanceState'
inputs:
InstanceIds:
- '{{ InstanceId }}'
DesiredState: running
- name: verifyRecovery
action: 'aws:waitForAwsResourceProperty'
timeoutSeconds: 300
inputs:
Service: ec2
Api: DescribeInstanceStatus
InstanceIds:
- '{{ InstanceId }}'
PropertySelector: '$.InstanceStatuses[0].InstanceStatus.Status'
DesiredValues:
- ok
Advanced Usage
Parameterized Runbook with Papermill
# Execute a Jupyter runbook with parameters
papermill incident-investigation.ipynb output-$(date +%s).ipynb \
-p service_name "payment-api" \
-p time_range "1h" \
-p severity "P1" \
-p incident_id "INC-2026-0518"
# Batch execute runbook across environments
for env in staging production; do
papermill scaling-check.ipynb "output-${env}.ipynb" \
-p environment "$env" \
-p threshold 80
done
Decision Tree Automation
#!/bin/bash
# Automated triage runbook with decision tree
diagnose_high_latency() {
local service="$1"
echo "=== Diagnosing High Latency: $service ==="
# Check CPU
CPU=$(kubectl top pods -l "app=$service" -n prod --no-headers \
| awk '{sum+=$2} END {print sum/NR}' | sed 's/m//')
if [ "$CPU" -gt 800 ]; then
echo "FINDING: High CPU (${CPU}m) — likely compute-bound"
echo "ACTION: Scale horizontally"
kubectl scale deployment "$service" -n prod --replicas=+2
return
fi
# Check memory
MEM_PCT=$(kubectl top pods -l "app=$service" -n prod --no-headers \
| awk '{print $3}' | sed 's/Mi//' | sort -rn | head -1)
LIMIT=$(kubectl get deployment "$service" -n prod \
-o jsonpath='{.spec.template.spec.containers[0].resources.limits.memory}' | sed 's/Mi//')
if [ $((MEM_PCT * 100 / LIMIT)) -gt 85 ]; then
echo "FINDING: Memory pressure (${MEM_PCT}Mi / ${LIMIT}Mi)"
echo "ACTION: Restart pods to clear memory"
kubectl rollout restart deployment "$service" -n prod
return
fi
# Check downstream dependencies
echo "FINDING: CPU/Memory normal — checking dependencies"
echo "ACTION: Run dependency health checks"
for dep in database cache queue; do
echo -n " $dep: "
kubectl exec -n prod deploy/"$service" -- curl -sf "http://${dep}:8080/health" \
&& echo "OK" || echo "FAILING"
done
}
Runbook Testing Framework
#!/bin/bash
# Test runbooks in staging before production use
test_runbook() {
local runbook="$1"
local env="staging"
echo "=== Testing Runbook: $runbook ==="
echo "Environment: $env"
echo "Started: $(date -u)"
# Create isolated test namespace
kubectl create namespace "runbook-test-$(date +%s)" || true
# Execute runbook in dry-run mode
DRY_RUN=true ENVIRONMENT="$env" bash "$runbook" 2>&1 | tee "/tmp/runbook-test-$$.log"
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -eq 0 ]; then
echo "PASS: Runbook executed successfully"
else
echo "FAIL: Runbook exited with code $EXIT_CODE"
fi
return $EXIT_CODE
}
# Run all runbook tests
for rb in runbooks/*.sh; do
test_runbook "$rb"
done
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Runbook fails silently | Missing set -euo pipefail | Add strict mode to all bash runbooks |
| Stale runbook content | Not reviewed regularly | Schedule quarterly runbook reviews |
| Variables undefined | Environment not sourced | Source env files at runbook start |
| Timeout during execution | Network or service latency | Add timeouts to all network calls |
| Permission denied | Insufficient RBAC | Verify service account has required roles |
| Rundeck job hangs | Node unreachable | Check SSH connectivity and node filters |
| SSM document fails | IAM role missing permissions | Verify AssumeRole policy includes required actions |
| Papermill kernel dies | Memory limits exceeded | Increase container memory or optimize notebook |
# Validate runbook syntax
bash -n runbook.sh && echo "Syntax OK" || echo "Syntax errors found"
# Check for common issues
shellcheck runbook.sh
# Test with debug output
bash -x runbook.sh 2>&1 | tee debug-output.log
# Verify required tools are available
for tool in kubectl curl jq psql; do
command -v "$tool" >/dev/null 2>&1 && echo "$tool: OK" || echo "$tool: MISSING"
done