Zum Inhalt

_

_

_

PagerDuty Cheatsheet

• Installation

CLI Tools Installation

Platform Method Command
Ubuntu/Debian Python CLI INLINE_CODE_9
Ubuntu/Debian Node.js CLI INLINE_CODE_10
macOS Homebrew (Python) INLINE_CODE_11
macOS Homebrew (Node) INLINE_CODE_12
Windows Python INLINE_CODE_13
Windows Chocolatey INLINE_CODE_14
Any Platform Docker INLINE_CODE_15

Pager Duty Agent Installation

Platform Command
Ubuntu/Debian INLINE_CODE_16
RHEL/CentOS INLINE_CODE_17
Start Agent INLINE_CODE_18

oder Grundlegende Befehle

Authentication & Setup

Command Description
INLINE_CODE_19 Authenticate and configure API token interactively
INLINE_CODE_20 Set API token via environment variable
INLINE_CODE_21 Test authentication and get current user info
INLINE_CODE_22 Set default user for operations

Incident Management

Command Description
INLINE_CODE_23 List all incidents
INLINE_CODE_24 List only triggered (active) incidents
INLINE_CODE_25 List acknowledged incidents
INLINE_CODE_26 Get detailed information about specific incident
INLINE_CODE_27 Acknowledge an incident
INLINE_CODE_28 Resolve an incident
INLINE_CODE_29 Add note to incident
INLINE_CODE_30 Reassign incident to different user
INLINE_CODE_31 Set incident priority (P1-P5)
INLINE_CODE_32 Snooze incident for specified seconds
_
### Service Management
Command Description
INLINE_CODE_33 List all services
INLINE_CODE_34 Get service details
INLINE_CODE_35 Disable a service
INLINE_CODE_36 Enable a service
INLINE_CODE_37 List integrations for a service

User & On-Call Management

Command Description
INLINE_CODE_38 List all users in account
INLINE_CODE_39 Get user details
INLINE_CODE_40 List current on-call users
INLINE_CODE_41 List user's contact methods
INLINE_CODE_42 List user's notification rules
_
### Pager Duty Agent Commands
Command Description
INLINE_CODE_43 Trigger new incident via agent
INLINE_CODE_44 Acknowledge incident via agent
INLINE_CODE_45 Resolve incident via agent
INLINE_CODE_46 Check agent service status
INLINE_CODE_47 View agent logs in real-time

/ Fortgeschrittene Nutzung

Advanced Incident Operations

Command Description
INLINE_CODE_48 Create incident with full details
INLINE_CODE_49 Merge multiple incidents into one
INLINE_CODE_50 List incidents within date range
INLINE_CODE_51 Filter incidents by service and urgency
INLINE_CODE_52 Extract incident IDs using jq
INLINE_CODE_53 Bulk acknowledge all triggered incidents
_
### REST API Operationen (curl)
Command Description
INLINE_CODE_54 List incidents via REST API
INLINE_CODE_55 Create incident via REST API
INLINE_CODE_56 Get on-call schedule via API
INLINE_CODE_57 Update incident status via API

Advanced Agent Operations

Command Description
INLINE_CODE_58 Send alert with severity and incident key
INLINE_CODE_59 Send alert with custom fields
INLINE_CODE_60 Send Events API v2 alert

Schedule Management

Command Description
INLINE_CODE_61 List all schedules
INLINE_CODE_62 Show schedule details with on-call users
INLINE_CODE_63 Create schedule override

Escaling Policy Management

Command Description
INLINE_CODE_64 List all escalation policies
INLINE_CODE_65 Get escalation policy details
_
### Analytics & Reporting
Command Description
INLINE_CODE_66 Get incident analytics for date range
INLINE_CODE_67 Extract incident data for custom reporting
_
Konfiguration

Umgebungsvariablen

# Set API token
export PDTOKEN="your_api_token_here"

# Set default region (for EU accounts)
export PD_API_BASE="https://api.eu.pagerduty.com"

# Set default user email
export PD_USER_EMAIL="user@example.com"

API Token Generation

ANHANG Loggen Sie sich in PagerDuty Web-Schnittstelle 2. Navigieren Sie zu ** Konfiguration → API Access** 3. Klicken Sie hier Neue API-Schlüssel erstellen 4. Wählen Sie User Token* oder **Account Token 5. Token kopieren und sicher speichern

Integration Keys

# Integration keys are service-specific
# Find them at: Service → Integrations → Integration Key

# Use in agent:
pd-send -k "your_integration_key" -t trigger -d "Alert message"

# Use in Events API v2:
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{
    "routing_key": "your_integration_key",
    "event_action": "trigger",
    "payload": {
      "summary": "Server down",
      "severity": "critical",
      "source": "prod-server-01"
    }
  }'

Seite Duty Agent Konfiguration

# Agent config location: /etc/pdagent.conf

# View current configuration
cat /etc/pdagent.conf

# Common settings:
# - pid_file: /var/run/pdagent/pdagent.pid
# - log_dir: /var/log/pdagent
# - outqueue_dir: /var/lib/pdagent/outqueue

Service Configuration Beispiel

{
  "service": {
    "name": "Production API",
    "description": "Main production API service",
    "escalation_policy": {
      "id": "ESCALATION_POLICY_ID",
      "type": "escalation_policy_reference"
    },
    "alert_creation": "create_alerts_and_incidents",
    "incident_urgency_rule": {
      "type": "constant",
      "urgency": "high"
    },
    "auto_resolve_timeout": 14400,
    "acknowledgement_timeout": 1800
  }
}

Häufige Anwendungsfälle

Use Case 1: Trigger and Resolving Incident from Monitoring

# Trigger incident when issue detected
pd-send -k R0123456789ABCDEF0123456789ABCDEF \
  -t trigger \
  -d "Database connection pool exhausted" \
  -s critical \
  -i db_pool_incident_001

# Add context as incident develops
pd-send -k R0123456789ABCDEF0123456789ABCDEF \
  -t trigger \
  -d "Connection count: 500/500" \
  -i db_pool_incident_001

# Resolve when fixed
pd-send -k R0123456789ABCDEF0123456789ABCDEF \
  -t resolve \
  -i db_pool_incident_001

Use Case 2: Überprüfen Sie, wer On-Call vor der Bereitstellung ist

# Get current on-call engineers
pd oncall:list --json | jq -r '.oncalls[] | "\(.escalation_policy.summary): \(.user.summary)"'

# Get on-call for specific escalation policy
pd oncall:list --escalation-policy-ids EP123456 --json | jq -r '.oncalls[].user.summary'

# Check schedule for next 7 days
pd schedule:show --id SCHEDULE_ID --since $(date -u +%Y-%m-%dT%H:%M:%SZ) --until $(date -u -d '+7 days' +%Y-%m-%dT%H:%M:%SZ)

Use Case 3: Bulk Incident Management während Outage

# Get all triggered incidents for a service
INCIDENTS=$(pd incident:list --service-ids SERVICE_ID --status triggered --json | jq -r '.incidents[].id')

# Acknowledge all incidents
echo "$INCIDENTS" | xargs -I {} pd incident:ack --id {}

# Add note to all incidents
echo "$INCIDENTS" | xargs -I {} pd incident:notes --id {} --note "Mass outage - investigating root cause"

# Resolve all incidents after fix
echo "$INCIDENTS" | xargs -I {} pd incident:resolve --id {}

Use Case 4: Incident mit Konferenzbrücke erstellen

# Create high-priority incident with Zoom link
curl -X POST "https://api.pagerduty.com/incidents" \
  -H "Authorization: Token token=$PDTOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/vnd.pagerduty+json;version=2" \
  -H "From: oncall@example.com" \
  -d '{
    "incident": {
      "type": "incident",
      "title": "Production database outage",
      "service": {
        "id": "SERVICE_ID",
        "type": "service_reference"
      },
      "urgency": "high",
      "priority": {
        "id": "PRIORITY_P1_ID",
        "type": "priority_reference"
      },
      "body": {
        "type": "incident_body",
        "details": "Primary database cluster unresponsive"
      },
      "conference_bridge": {
        "conference_number": "https://zoom.us/j/1234567890",
        "conference_url": "https://zoom.us/j/1234567890"
      }
    }
  }'

Use Case 5: Weekly Incident Report generieren

# Get incidents from last week
LAST_WEEK=$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ)
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)

pd incident:list --since $LAST_WEEK --until $NOW --json | \
  jq -r '.incidents[] | [.created_at, .urgency, .status, .title] | @csv' > weekly_incidents.csv

# Count incidents by service
pd incident:list --since $LAST_WEEK --until $NOW --json | \
  jq -r '.incidents[] | .service.summary' | sort | uniq -c | sort -rn

# Calculate mean time to acknowledge
pd incident:list --since $LAST_WEEK --until $NOW --json | \
  jq '[.incidents[] | select(.status == "resolved") | 
    (.first_trigger_log_entry.created_at as $trigger | 
     .acknowledgements[0].at as $ack | 
     ($ack | fromdateiso8601) - ($trigger | fromdateiso8601))] | 
    add / length / 60' # Result in minutes

oder Best Practices

  • **Einfallsschlüssel für Deduplizierung verwenden*: Geben Sie immer konsistente Vorfallsschlüssel (-i_ flag) an, um doppelte Warnungen für das gleiche Problem zu verhindern
  • ** Stellen Sie geeignete Dringlichkeiten*: Verwenden high Dringlichkeit für kritische Produktionsfragen, low für nicht-chirurgische Meldungen, um Alarmermüdung zu vermeiden
  • **Verwalten der Autoauflösung*: Konfigurieren von Diensten mit auto_resolve_timeout, um die Ereignisse automatisch zu schließen, wenn die Überwachung die Wiederherstellung zeigt
  • ** Umsetzung der Eskalationsrichtlinien*: Erstellen Sie mehrstufige Eskalationsrichtlinien, um sicherzustellen, dass Vorfälle jemanden erreichen, der antworten kann
  • ** Kontext zu Vorfällen hinzufügen*: Fügen Sie relevante Details in Vorfallbeschreibungen, Notizen und benutzerdefinierte Felder ein, um die Auflösung zu beschleunigen
  • **Benutzen Sie Fahrplanüberschreitungen*: Planen Sie Urlaubs- und Zeitplanänderungen, indem Sie Überschreitungen erstellen, anstatt Basispläne zu ändern
  • **Tag und kategorisieren Vorfälle*: Verwenden Sie die konsequente Markierung für Vorfälle, um eine bessere Berichterstattung und Trendanalyse zu ermöglichen
  • **Test-Integrationen regelmäßig*: Testwarnungen senden, um zu überprüfen, ob die Überwachungsintegration richtig funktioniert
  • **Review Vorfallanalyse*: Regelmäßig MTTA (Mean Time to Acknowledge) und MTTR (Mean Time to Resolve) Metriken analysieren
  • **Dokumente-Laufbücher*: Link Vorfälle zu Runbooks und Dokumentationen, um den Befragten zu helfen, häufige Probleme zu lösen
  • Benutzen Sie Statusseiten: Halten Sie Stakeholder durch die Verbindung von Vorfällen zu Statusseiten für transparente Kommunikation informiert

Fehlerbehebung

Issue Solution
Authentication fails with "Invalid token" Verify token with INLINE_CODE_72. Generate new token at Configuration → API Access. Ensure token has correct permissions.
Agent not sending events Check agent status: INLINE_CODE_73. View logs: INLINE_CODE_74. Verify integration key is correct. Test connectivity: INLINE_CODE_75
Incidents not triggering Verify service is enabled: INLINE_CODE_76. Check integration key matches. Ensure service has valid escalation policy assigned.
No notifications received Check user contact methods: INLINE_CODE_77. Verify notification rules: INLINE_CODE_78. Test contact method in PagerDuty UI.
CLI returns "Service Unavailable" Check PagerDuty status at status.pagerduty.com. Verify API endpoint (use INLINE_CODE_79 for EU accounts). Check network connectivity and firewall rules.
Duplicate incidents created Use consistent incident keys with INLINE_CODE_80 flag. Configure alert grouping in service settings. Set appropriate deduplication time windows.
Schedule shows wrong on-call person Verify timezone settings in schedule configuration. Check for active overrides: INLINE_CODE_81. Ensure schedule layers are configured correctly.
API rate limit exceeded Implement exponential backoff in scripts. Use bulk operations where possible. Cache frequently accessed data. Check rate limit headers in API responses.
Events API v2 returns 400 error Validate JSON payload structure. Ensure INLINE_CODE_82 (not integration_key) is used. Check required fields: INLINE_CODE_83, INLINE_CODE_84, INLINE_CODE_85. Verify INLINE_CODE_86 is valid (trigger/acknowledge/resolve).
Cannot resolve incident Check if incident is already resolved. Verify user has permissions to resolve. Ensure incident ID is correct. Try via web UI to rule out API issues.
_
--

 Quick Reference: Event Severity Levels

Severity Use Case
INLINE_CODE_87 Service outage, data loss, security breach
INLINE_CODE_88 Service degradation, failed jobs, errors affecting users
INLINE_CODE_89 Potential issues, threshold breaches, degraded performance
INLINE_CODE_90 Informational events, successful deployments, routine notifications

Quick Reference: Incident Priorities

Priority Response Time Use Case
INLINE_CODE_91 Immediate Complete service outage, critical security incident
INLINE_CODE_92 < 30 minutes Major feature broken, significant performance degradation
INLINE_CODE_93 < 2 hours Minor feature issues, isolated customer impact
INLINE_CODE_94 < 8 hours Small bugs, cosmetic issues
_ P5 Nächster Geschäftstag