Aller au contenu

Robusta Cheat Sheet

Overview

Robusta is an open-source Kubernetes troubleshooting and automation platform that enriches Prometheus alerts with diagnostic data, automates investigation workflows, and can perform remediation actions. When an alert fires, Robusta automatically gathers relevant context — pod logs, resource usage, event history, and related metrics — and sends enriched notifications to Slack, Teams, or other channels, dramatically reducing the time analysts spend on initial triage.

Beyond alert enrichment, Robusta provides a proactive monitoring layer that detects common Kubernetes issues like OOMKills, CrashLoopBackOff pods, high resource usage, and failed deployments without requiring custom alerting rules. It supports custom playbooks written in Python that can implement any investigation or remediation logic, from scaling deployments to restarting pods to creating Jira tickets. The Robusta SaaS UI provides a centralized dashboard for multi-cluster observability.

Installation

Helm Installation

# Generate Robusta configuration
pip install robusta-cli
robusta gen-config

# This creates generated_values.yaml with:
# - Slack/Teams webhook configuration
# - Cluster name and account details
# - Default playbook configuration

# Install Robusta with Helm
helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update

helm install robusta robusta/robusta \
  --namespace robusta \
  --create-namespace \
  --values generated_values.yaml \
  --set clusterName="production-us-east"

# Install with bundled Prometheus stack
helm install robusta robusta/robusta \
  --namespace robusta \
  --create-namespace \
  --values generated_values.yaml \
  --set enablePrometheusStack=true

# Verify installation
kubectl get pods -n robusta

CLI Installation

# Install Robusta CLI
pip install robusta-cli

# Or via pipx
pipx install robusta-cli

# Verify
robusta version

# Generate initial configuration
robusta gen-config

# Test connectivity
robusta logs -n robusta

Core Commands — Alert Management

Enriched Alert Workflow

# generated_values.yaml — Alert enrichment configuration
globalConfig:
  cluster_name: production-us-east
  signing_key: "your-signing-key"
  account_id: "your-account-id"

sinksConfig:
  - slack_sink:
      name: main-slack
      slack_channel: "#k8s-alerts"
      api_key: "xoxb-slack-bot-token"

  - ms_teams_sink:
      name: teams-alerts
      webhook_url: "https://outlook.office.com/webhook/..."

  - robusta_sink:
      name: robusta-ui
      token: "your-robusta-token"

# Built-in playbooks for common alerts
customPlaybooks:
  # Enrich OOMKill alerts with memory graphs
  - triggers:
      - on_pod_oom_killed: {}
    actions:
      - oom_killer_enricher: {}
      - logs_enricher:
          container_tail_lines: 100

  # Enrich CrashLoopBackOff
  - triggers:
      - on_pod_crash_loop:
          restart_reason: CrashLoopBackOff
    actions:
      - crash_loop_reporter: {}
      - logs_enricher:
          container_tail_lines: 200

  # Prometheus alert enrichment
  - triggers:
      - on_prometheus_alert:
          alert_name: HighCPUUsage
    actions:
      - cpu_graph_enricher:
          duration_minutes: 30
      - pod_enricher: {}
      - logs_enricher: {}

Deployment Tracking

# Track all deployment changes
customPlaybooks:
  - triggers:
      - on_deployment_update:
          name_prefix: ""
          namespace_prefix: ""
    actions:
      - deployment_status_enricher: {}
    sinks:
      - main-slack

  # Alert on failed deployments
  - triggers:
      - on_deployment_update:
          status: ["Failed"]
    actions:
      - deployment_status_enricher: {}
      - logs_enricher: {}
      - event_enricher: {}
    sinks:
      - main-slack
      - robusta-ui

Core Commands — Built-in Actions

Investigation Actions

# Pod investigation playbook
customPlaybooks:
  # Full pod investigation on any alert
  - triggers:
      - on_prometheus_alert:
          alert_name: KubePodNotReady
    actions:
      - pod_enricher: {}
      - logs_enricher:
          container_tail_lines: 200
          previous_container: true
      - event_enricher: {}
      - pod_graph_enricher:
          resource_type: Memory
          duration_minutes: 60
      - pod_graph_enricher:
          resource_type: CPU
          duration_minutes: 60
      - node_enricher: {}

  # Node investigation
  - triggers:
      - on_prometheus_alert:
          alert_name: KubeNodeNotReady
    actions:
      - node_enricher: {}
      - node_graph_enricher:
          resource_type: CPU
          duration_minutes: 120
      - node_graph_enricher:
          resource_type: Memory
          duration_minutes: 120
      - event_enricher: {}

Remediation Actions

# Automated remediation playbooks
customPlaybooks:
  # Auto-restart pods stuck in CrashLoopBackOff
  - triggers:
      - on_pod_crash_loop:
          restart_reason: CrashLoopBackOff
          restart_count: 10
    actions:
      - logs_enricher: {}
      - delete_pod: {}

  # Auto-scale deployment on high CPU
  - triggers:
      - on_prometheus_alert:
          alert_name: HighCPUUsage
    actions:
      - cpu_graph_enricher: {}
      - horizontal_pod_autoscaler:
          max_replicas: 10
          increase_pct: 50

  # Restart deployment on memory leak
  - triggers:
      - on_prometheus_alert:
          alert_name: MemoryLeakDetected
          namespace: production
    actions:
      - pod_graph_enricher:
          resource_type: Memory
      - rollout_restart: {}

  # Cordon node on disk pressure
  - triggers:
      - on_prometheus_alert:
          alert_name: NodeDiskPressure
    actions:
      - node_enricher: {}
      - cordon_node: {}

Configuration

Sink Configuration

# Multiple notification sinks
sinksConfig:
  # Slack with routing
  - slack_sink:
      name: critical-alerts
      slack_channel: "#critical-alerts"
      api_key: "xoxb-token"
      match:
        severity: [HIGH, CRITICAL]

  - slack_sink:
      name: warning-alerts
      slack_channel: "#k8s-warnings"
      api_key: "xoxb-token"
      match:
        severity: [LOW, MEDIUM]

  # PagerDuty integration
  - pagerduty_sink:
      name: pagerduty
      api_key: "pagerduty-integration-key"
      match:
        severity: [CRITICAL]

  # Jira ticket creation
  - jira_sink:
      name: jira-tickets
      url: "https://company.atlassian.net"
      username: "service-account@company.com"
      api_key: "jira-api-token"
      project_name: "OPS"
      issue_type: "Bug"
      match:
        severity: [HIGH, CRITICAL]

  # Webhook (generic)
  - webhook_sink:
      name: custom-webhook
      url: "https://api.internal/k8s-events"
      headers:
        Authorization: "Bearer token"

  # Robusta SaaS UI
  - robusta_sink:
      name: robusta-ui
      token: "robusta-ui-token"

Playbook Filters

# Fine-grained playbook targeting
customPlaybooks:
  # Only for specific namespaces
  - triggers:
      - on_pod_oom_killed:
          namespace_prefix: "production"
    actions:
      - oom_killer_enricher: {}
    sinks:
      - critical-alerts

  # Exclude system namespaces
  - triggers:
      - on_pod_crash_loop:
          exclude_namespace: "kube-system"
    actions:
      - crash_loop_reporter: {}

  # Label-based targeting
  - triggers:
      - on_prometheus_alert:
          alert_name: HighLatency
          pod_label_selector: "tier=frontend"
    actions:
      - logs_enricher: {}
      - pod_graph_enricher: {}

  # Time-based (silence during maintenance)
  - triggers:
      - on_prometheus_alert:
          alert_name: ".*"
    actions:
      - silence_alert:
          duration: 4h
    when:
      - schedule:
          start: "02:00"
          end: "06:00"
          timezone: "UTC"

Advanced Usage

Custom Python Playbooks

# custom_playbooks/investigate_service.py
from robusta.api import *

@action
def investigate_service(event: PrometheusKubernetesAlert, params: dict):
    """Custom investigation playbook for service alerts."""
    pod = event.get_pod()
    if not pod:
        return
    
    # Add custom enrichment
    finding = Finding(
        title=f"Service Investigation: {pod.metadata.name}",
        source=FindingSource.PROMETHEUS,
        severity=FindingSeverity.HIGH,
    )
    
    # Gather service mesh data
    finding.add_enrichment([
        MarkdownBlock(f"**Pod:** {pod.metadata.name}"),
        MarkdownBlock(f"**Namespace:** {pod.metadata.namespace}"),
        MarkdownBlock(f"**Node:** {pod.spec.node_name}"),
    ])
    
    # Check resource limits
    for container in pod.spec.containers:
        if container.resources and container.resources.limits:
            cpu_limit = container.resources.limits.get("cpu", "not set")
            mem_limit = container.resources.limits.get("memory", "not set")
            finding.add_enrichment([
                MarkdownBlock(f"**{container.name}** — CPU: {cpu_limit}, Memory: {mem_limit}"),
            ])
    
    # Add logs
    logs = pod.get_logs(container_name=pod.spec.containers[0].name, max_lines=50)
    if logs:
        finding.add_enrichment([
            FileBlock("pod-logs.txt", logs.encode()),
        ])
    
    event.add_finding(finding)
# Register custom playbook
customPlaybooks:
  - triggers:
      - on_prometheus_alert:
          alert_name: ServiceDegraded
    actions:
      - investigate_service: {}

Robusta CLI Operations

# Trigger a playbook manually
robusta playbooks trigger pod_enricher -n production -p my-pod

# List running playbooks
robusta playbooks list

# View Robusta runner logs
robusta logs

# Test Slack connectivity
robusta demo

# Generate test alert to verify pipeline
robusta demo --alert-name TestAlert

# Upgrade Robusta
helm upgrade robusta robusta/robusta \
  --namespace robusta \
  --values generated_values.yaml

Multi-Cluster Setup

# Each cluster gets its own Robusta installation
# All report to the same Robusta SaaS UI

# Cluster 1 values
globalConfig:
  cluster_name: "us-east-production"
  account_id: "shared-account-id"

# Cluster 2 values
globalConfig:
  cluster_name: "eu-west-production"
  account_id: "shared-account-id"

Troubleshooting

IssueCauseSolution
No alerts in SlackSlack token or channel wrongVerify bot token has chat:write and channel exists
Playbook not triggeringTrigger conditions don’t matchCheck alert name regex and namespace filters
OOM enrichment missing graphsPrometheus not accessibleVerify Robusta can reach Prometheus URL
Custom playbook errorPython syntax or import issueCheck runner logs: kubectl logs -n robusta -l app=robusta-runner
High memory usageToo many playbooks or large log collectionReduce container_tail_lines and limit active playbooks
Duplicate notificationsMultiple sinks matching same severityUse match filters to route alerts to specific sinks
Deployment tracking missedWebhook not catching all eventsEnsure Robusta has watch permissions on deployments
Jira tickets not createdAPI credentials incorrectVerify Jira URL, username, and API token
# Debug: check Robusta runner logs
kubectl logs -n robusta -l app=robusta-runner --tail=100

# Debug: check Robusta forwarder
kubectl logs -n robusta -l app=robusta-forwarder --tail=50

# Verify Prometheus connectivity from Robusta
kubectl exec -n robusta deploy/robusta-runner -- \
  curl -s "http://prometheus-server.monitoring:9090/api/v1/query?query=up"

# Test playbook execution
robusta playbooks trigger event_enricher

# Check playbook configuration
kubectl get configmap -n robusta robusta-playbooks -o yaml

# Restart Robusta components
kubectl rollout restart deployment -n robusta robusta-runner
kubectl rollout restart deployment -n robusta robusta-forwarder