# Set API tokenexportPDTOKEN="your_api_token_here"# Set default region (for EU accounts)exportPD_API_BASE="https://api.eu.pagerduty.com"# Set default user emailexportPD_USER_EMAIL="user@example.com"
# Integration keys are service-specific# Find them at: Service → Integrations → Integration Key# Use in agent:pd-send-k"your_integration_key"-ttrigger-d"Alert message"# Use in Events API v2:curl-XPOSThttps://events.pagerduty.com/v2/enqueue\-H"Content-Type: application/json"\-d'{ "routing_key": "your_integration_key", "event_action": "trigger", "payload": { "summary": "Server down", "severity": "critical", "source": "prod-server-01" } }'
# Agent config location: /etc/pdagent.conf# View current configurationcat/etc/pdagent.conf
# Common settings:# - pid_file: /var/run/pdagent/pdagent.pid# - log_dir: /var/log/pdagent# - outqueue_dir: /var/lib/pdagent/outqueue
Ejemplo de configuración de servicio
{"service":{"name":"Production API","description":"Main production API service","escalation_policy":{"id":"ESCALATION_POLICY_ID","type":"escalation_policy_reference"},"alert_creation":"create_alerts_and_incidents","incident_urgency_rule":{"type":"constant","urgency":"high"},"auto_resolve_timeout":14400,"acknowledgement_timeout":1800}}
Use Case 1: Trigger and Resolve Incident from Monitoring¶
# Trigger incident when issue detectedpd-send-kR0123456789ABCDEF0123456789ABCDEF\-ttrigger\-d"Database connection pool exhausted"\-scritical\-idb_pool_incident_001
# Add context as incident developspd-send-kR0123456789ABCDEF0123456789ABCDEF\-ttrigger\-d"Connection count: 500/500"\-idb_pool_incident_001
# Resolve when fixedpd-send-kR0123456789ABCDEF0123456789ABCDEF\-tresolve\-idb_pool_incident_001
Use Case 2: Check Who's On-Call before Deployment¶
# Get current on-call engineerspdoncall:list--json|jq-r'.oncalls[] | "\(.escalation_policy.summary): \(.user.summary)"'# Get on-call for specific escalation policypdoncall:list--escalation-policy-idsEP123456--json|jq-r'.oncalls[].user.summary'# Check schedule for next 7 dayspdschedule:show--idSCHEDULE_ID--since$(date-u+%Y-%m-%dT%H:%M:%SZ)--until$(date-u-d'+7 days'+%Y-%m-%dT%H:%M:%SZ)
Use Case 3: Bulk Incident Management During Outage¶
# Get all triggered incidents for a serviceINCIDENTS=$(pdincident:list--service-idsSERVICE_ID--statustriggered--json|jq-r'.incidents[].id')# Acknowledge all incidentsecho"$INCIDENTS"|xargs-I{}pdincident:ack--id{}# Add note to all incidentsecho"$INCIDENTS"|xargs-I{}pdincident:notes--id{}--note"Mass outage - investigating root cause"# Resolve all incidents after fixecho"$INCIDENTS"|xargs-I{}pdincident:resolve--id{}
Use Case 4: Create Incident with Conference Bridge¶
# Get incidents from last weekLAST_WEEK=$(date-u-d'7 days ago'+%Y-%m-%dT%H:%M:%SZ)NOW=$(date-u+%Y-%m-%dT%H:%M:%SZ)pdincident:list--since$LAST_WEEK--until$NOW--json|\jq-r'.incidents[] | [.created_at, .urgency, .status, .title] | @csv'>weekly_incidents.csv
# Count incidents by servicepdincident:list--since$LAST_WEEK--until$NOW--json|\jq-r'.incidents[] | .service.summary'|sort|uniq-c|sort-rn
# Calculate mean time to acknowledgepdincident:list--since$LAST_WEEK--until$NOW--json|\jq'[.incidents[] | select(.status == "resolved") | (.first_trigger_log_entry.created_at as $trigger | .acknowledgements[0].at as $ack | ($ack | fromdateiso8601) - ($trigger | fromdateiso8601))] | add / length / 60'# Result in minutes
Use claves de incidentes para la deduplicación: Siempre proporcionar claves de incidentes consistentes (-i flag) para prevenir las alertas duplicadas para el mismo problema
Las urgencias adecuadas: Uso high __ urgencia para problemas de producción críticos, low para notificaciones no urgentes para evitar la fatiga alerta
Resolución automática del curso: Configurar los servicios con __INLINE_CODE_71_ para cerrar automáticamente los incidentes cuando el monitoreo muestra recuperación
Políticas de intensificación de la aplicación: Crear políticas de escalada multinivel para asegurar que los incidentes lleguen a alguien que pueda responder
Agregar contexto a incidentes: Incluye detalles relevantes en descripciones de incidentes, notas y campos personalizados para acelerar la resolución
Utilizar el calendario se anula: Planear las vacaciones y planificar los cambios creando anulaciones en lugar de modificar los horarios de base
Tag and categorize incidents: Utilizar etiquetas consistentes para incidentes que permitan mejorar la presentación de informes y el análisis de tendencias
Las integraciones más frecuentes: Enviar alertas de prueba para verificar las integraciones de monitoreo están funcionando correctamente
Revisión de análisis de incidentes: Analice regularmente las métricas MTTA (Mean Time to Acknowledge) y MTTR (Mean Time to Resolve)
Libros de documentos: Link incidents to runbooks and documentation to help responds quickly resolve common issues
Utilizar páginas de estado: Mantener informado a los interesados conectando incidentes a páginas de estado para una comunicación transparente
Verify token with INLINE_CODE_72. Generate new token at Configuration → API Access. Ensure token has correct permissions.
Agent not sending events
Check agent status: INLINE_CODE_73. View logs: INLINE_CODE_74. Verify integration key is correct. Test connectivity: INLINE_CODE_75
Incidents not triggering
Verify service is enabled: INLINE_CODE_76. Check integration key matches. Ensure service has valid escalation policy assigned.
No notifications received
Check user contact methods: INLINE_CODE_77. Verify notification rules: INLINE_CODE_78. Test contact method in PagerDuty UI.
CLI returns "Service Unavailable"
Check PagerDuty status at status.pagerduty.com. Verify API endpoint (use INLINE_CODE_79 for EU accounts). Check network connectivity and firewall rules.
Duplicate incidents created
Use consistent incident keys with INLINE_CODE_80 flag. Configure alert grouping in service settings. Set appropriate deduplication time windows.
Schedule shows wrong on-call person
Verify timezone settings in schedule configuration. Check for active overrides: INLINE_CODE_81. Ensure schedule layers are configured correctly.
API rate limit exceeded
Implement exponential backoff in scripts. Use bulk operations where possible. Cache frequently accessed data. Check rate limit headers in API responses.