Saltar a contenido

PagerDuty Cheatsheet

Instalación

CLI Herramientas Instalación

__TABLE_103_

Pager Instalación del agente de servicio

__TABLE_104_

Comandos básicos

Authentication & Setup

Command Description
INLINE_CODE_19 Authenticate and configure API token interactively
INLINE_CODE_20 Set API token via environment variable
INLINE_CODE_21 Test authentication and get current user info
INLINE_CODE_22 Set default user for operations

Incident Management

Command Description
INLINE_CODE_23 List all incidents
INLINE_CODE_24 List only triggered (active) incidents
INLINE_CODE_25 List acknowledged incidents
INLINE_CODE_26 Get detailed information about specific incident
INLINE_CODE_27 Acknowledge an incident
INLINE_CODE_28 Resolve an incident
INLINE_CODE_29 Add note to incident
INLINE_CODE_30 Reassign incident to different user
INLINE_CODE_31 Set incident priority (P1-P5)
INLINE_CODE_32 Snooze incident for specified seconds

Service Management

Command Description
INLINE_CODE_33 List all services
INLINE_CODE_34 Get service details
INLINE_CODE_35 Disable a service
INLINE_CODE_36 Enable a service
INLINE_CODE_37 List integrations for a service

User & On-Call Management

Command Description
INLINE_CODE_38 List all users in account
INLINE_CODE_39 Get user details
INLINE_CODE_40 List current on-call users
INLINE_CODE_41 List user's contact methods
INLINE_CODE_42 List user's notification rules

Pager Comandos del agente de servicio

Command Description
INLINE_CODE_43 Trigger new incident via agent
INLINE_CODE_44 Acknowledge incident via agent
INLINE_CODE_45 Resolve incident via agent
INLINE_CODE_46 Check agent service status
INLINE_CODE_47 View agent logs in real-time

Advanced Usage

Advanced Incident Operations

Command Description
INLINE_CODE_48 Create incident with full details
INLINE_CODE_49 Merge multiple incidents into one
INLINE_CODE_50 List incidents within date range
INLINE_CODE_51 Filter incidents by service and urgency
INLINE_CODE_52 Extract incident IDs using jq
INLINE_CODE_53 Bulk acknowledge all triggered incidents

REST API Operations (curl)

Command Description
INLINE_CODE_54 List incidents via REST API
INLINE_CODE_55 Create incident via REST API
INLINE_CODE_56 Get on-call schedule via API
INLINE_CODE_57 Update incident status via API

Advanced Agent Operations

Command Description
INLINE_CODE_58 Send alert with severity and incident key
INLINE_CODE_59 Send alert with custom fields
INLINE_CODE_60 Send Events API v2 alert

Schedule Management

__TABLE_113_

Escalation Policy Management

__TABLE_114_

Analytics > Reporting

Command Description
INLINE_CODE_66 Get incident analytics for date range
INLINE_CODE_67 Extract incident data for custom reporting

Configuración

Environment Variables

# Set API token
export PDTOKEN="your_api_token_here"

# Set default region (for EU accounts)
export PD_API_BASE="https://api.eu.pagerduty.com"

# Set default user email
export PD_USER_EMAIL="user@example.com"

API Token Generation

  1. Inicie sesión en la interfaz web de PagerDuty
  2. Navegar a Configuración → Acceso a la API
  3. Haga clic Crear nueva clave de API
  4. Elija User Token o Account Token
  5. Copiar ficha y guardar de forma segura

Las claves de integración

# Integration keys are service-specific
# Find them at: Service → Integrations → Integration Key

# Use in agent:
pd-send -k "your_integration_key" -t trigger -d "Alert message"

# Use in Events API v2:
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{
    "routing_key": "your_integration_key",
    "event_action": "trigger",
    "payload": {
      "summary": "Server down",
      "severity": "critical",
      "source": "prod-server-01"
    }
  }'

Pager Configuración del agente de servicio

# Agent config location: /etc/pdagent.conf

# View current configuration
cat /etc/pdagent.conf

# Common settings:
# - pid_file: /var/run/pdagent/pdagent.pid
# - log_dir: /var/log/pdagent
# - outqueue_dir: /var/lib/pdagent/outqueue

Ejemplo de configuración de servicio

{
  "service": {
    "name": "Production API",
    "description": "Main production API service",
    "escalation_policy": {
      "id": "ESCALATION_POLICY_ID",
      "type": "escalation_policy_reference"
    },
    "alert_creation": "create_alerts_and_incidents",
    "incident_urgency_rule": {
      "type": "constant",
      "urgency": "high"
    },
    "auto_resolve_timeout": 14400,
    "acknowledgement_timeout": 1800
  }
}

Common Use Cases

Use Case 1: Trigger and Resolve Incident from Monitoring

# Trigger incident when issue detected
pd-send -k R0123456789ABCDEF0123456789ABCDEF \
  -t trigger \
  -d "Database connection pool exhausted" \
  -s critical \
  -i db_pool_incident_001

# Add context as incident develops
pd-send -k R0123456789ABCDEF0123456789ABCDEF \
  -t trigger \
  -d "Connection count: 500/500" \
  -i db_pool_incident_001

# Resolve when fixed
pd-send -k R0123456789ABCDEF0123456789ABCDEF \
  -t resolve \
  -i db_pool_incident_001

Use Case 2: Check Who's On-Call before Deployment

# Get current on-call engineers
pd oncall:list --json | jq -r '.oncalls[] | "\(.escalation_policy.summary): \(.user.summary)"'

# Get on-call for specific escalation policy
pd oncall:list --escalation-policy-ids EP123456 --json | jq -r '.oncalls[].user.summary'

# Check schedule for next 7 days
pd schedule:show --id SCHEDULE_ID --since $(date -u +%Y-%m-%dT%H:%M:%SZ) --until $(date -u -d '+7 days' +%Y-%m-%dT%H:%M:%SZ)

Use Case 3: Bulk Incident Management During Outage

# Get all triggered incidents for a service
INCIDENTS=$(pd incident:list --service-ids SERVICE_ID --status triggered --json | jq -r '.incidents[].id')

# Acknowledge all incidents
echo "$INCIDENTS" | xargs -I {} pd incident:ack --id {}

# Add note to all incidents
echo "$INCIDENTS" | xargs -I {} pd incident:notes --id {} --note "Mass outage - investigating root cause"

# Resolve all incidents after fix
echo "$INCIDENTS" | xargs -I {} pd incident:resolve --id {}

Use Case 4: Create Incident with Conference Bridge

# Create high-priority incident with Zoom link
curl -X POST "https://api.pagerduty.com/incidents" \
  -H "Authorization: Token token=$PDTOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/vnd.pagerduty+json;version=2" \
  -H "From: oncall@example.com" \
  -d '{
    "incident": {
      "type": "incident",
      "title": "Production database outage",
      "service": {
        "id": "SERVICE_ID",
        "type": "service_reference"
      },
      "urgency": "high",
      "priority": {
        "id": "PRIORITY_P1_ID",
        "type": "priority_reference"
      },
      "body": {
        "type": "incident_body",
        "details": "Primary database cluster unresponsive"
      },
      "conference_bridge": {
        "conference_number": "https://zoom.us/j/1234567890",
        "conference_url": "https://zoom.us/j/1234567890"
      }
    }
  }'

Use Case 5: Generate Weekly Incident Report

# Get incidents from last week
LAST_WEEK=$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ)
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)

pd incident:list --since $LAST_WEEK --until $NOW --json | \
  jq -r '.incidents[] | [.created_at, .urgency, .status, .title] | @csv' > weekly_incidents.csv

# Count incidents by service
pd incident:list --since $LAST_WEEK --until $NOW --json | \
  jq -r '.incidents[] | .service.summary' | sort | uniq -c | sort -rn

# Calculate mean time to acknowledge
pd incident:list --since $LAST_WEEK --until $NOW --json | \
  jq '[.incidents[] | select(.status == "resolved") | 
    (.first_trigger_log_entry.created_at as $trigger | 
     .acknowledgements[0].at as $ack | 
     ($ack | fromdateiso8601) - ($trigger | fromdateiso8601))] | 
    add / length / 60' # Result in minutes

Buenas prácticas

  • Use claves de incidentes para la deduplicación: Siempre proporcionar claves de incidentes consistentes (-i flag) para prevenir las alertas duplicadas para el mismo problema Las urgencias adecuadas: Uso high __ urgencia para problemas de producción críticos, low para notificaciones no urgentes para evitar la fatiga alerta
  • Resolución automática del curso: Configurar los servicios con __INLINE_CODE_71_ para cerrar automáticamente los incidentes cuando el monitoreo muestra recuperación Políticas de intensificación de la aplicación: Crear políticas de escalada multinivel para asegurar que los incidentes lleguen a alguien que pueda responder Agregar contexto a incidentes: Incluye detalles relevantes en descripciones de incidentes, notas y campos personalizados para acelerar la resolución
  • Utilizar el calendario se anula: Planear las vacaciones y planificar los cambios creando anulaciones en lugar de modificar los horarios de base Tag and categorize incidents: Utilizar etiquetas consistentes para incidentes que permitan mejorar la presentación de informes y el análisis de tendencias
  • Las integraciones más frecuentes: Enviar alertas de prueba para verificar las integraciones de monitoreo están funcionando correctamente
  • Revisión de análisis de incidentes: Analice regularmente las métricas MTTA (Mean Time to Acknowledge) y MTTR (Mean Time to Resolve)
  • Libros de documentos: Link incidents to runbooks and documentation to help responds quickly resolve common issues Utilizar páginas de estado: Mantener informado a los interesados conectando incidentes a páginas de estado para una comunicación transparente

Troubleshooting

Issue Solution
Authentication fails with "Invalid token" Verify token with INLINE_CODE_72. Generate new token at Configuration → API Access. Ensure token has correct permissions.
Agent not sending events Check agent status: INLINE_CODE_73. View logs: INLINE_CODE_74. Verify integration key is correct. Test connectivity: INLINE_CODE_75
Incidents not triggering Verify service is enabled: INLINE_CODE_76. Check integration key matches. Ensure service has valid escalation policy assigned.
No notifications received Check user contact methods: INLINE_CODE_77. Verify notification rules: INLINE_CODE_78. Test contact method in PagerDuty UI.
CLI returns "Service Unavailable" Check PagerDuty status at status.pagerduty.com. Verify API endpoint (use INLINE_CODE_79 for EU accounts). Check network connectivity and firewall rules.
Duplicate incidents created Use consistent incident keys with INLINE_CODE_80 flag. Configure alert grouping in service settings. Set appropriate deduplication time windows.
Schedule shows wrong on-call person Verify timezone settings in schedule configuration. Check for active overrides: INLINE_CODE_81. Ensure schedule layers are configured correctly.
API rate limit exceeded Implement exponential backoff in scripts. Use bulk operations where possible. Cache frequently accessed data. Check rate limit headers in API responses.
Events API v2 returns 400 error Validate JSON payload structure. Ensure INLINE_CODE_82 (not integration_key) is used. Check required fields: INLINE_CODE_83, INLINE_CODE_84, INLINE_CODE_85. Verify INLINE_CODE_86 is valid (trigger/acknowledge/resolve).
Cannot resolve incident Check if incident is already resolved. Verify user has permissions to resolve. Ensure incident ID is correct. Try via web UI to rule out API issues.

-...

Quick Reference: Event Severity Levels

Severity Use Case
INLINE_CODE_87 Service outage, data loss, security breach
INLINE_CODE_88 Service degradation, failed jobs, errors affecting users
INLINE_CODE_89 Potential issues, threshold breaches, degraded performance
INLINE_CODE_90 Informational events, successful deployments, routine notifications

Quick Reference: Incident Priorities

Priority Response Time Use Case
INLINE_CODE_91 Immediate Complete service outage, critical security incident
INLINE_CODE_92 < 30 minutes Major feature broken, significant performance degradation
INLINE_CODE_93 < 2 hours Minor feature issues, isolated customer impact
INLINE_CODE_94 < 8 hours Small bugs, cosmetic issues
P5 ← Día de la próxima actividad ← Solicitudes de mejora, actualizaciones de la documentación