Robusta Cheat Sheet
Overview
Robusta is an open-source Kubernetes troubleshooting and automation platform that enriches Prometheus alerts with diagnostic data, automates investigation workflows, and can perform remediation actions. When an alert fires, Robusta automatically gathers relevant context — pod logs, resource usage, event history, and related metrics — and sends enriched notifications to Slack, Teams, or other channels, dramatically reducing the time analysts spend on initial triage.
Beyond alert enrichment, Robusta provides a proactive monitoring layer that detects common Kubernetes issues like OOMKills, CrashLoopBackOff pods, high resource usage, and failed deployments without requiring custom alerting rules. It supports custom playbooks written in Python that can implement any investigation or remediation logic, from scaling deployments to restarting pods to creating Jira tickets. The Robusta SaaS UI provides a centralized dashboard for multi-cluster observability.
Installation
Helm Installation
# Generate Robusta configuration
pip install robusta-cli
robusta gen-config
# This creates generated_values.yaml with:
# - Slack/Teams webhook configuration
# - Cluster name and account details
# - Default playbook configuration
# Install Robusta with Helm
helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install robusta robusta/robusta \
--namespace robusta \
--create-namespace \
--values generated_values.yaml \
--set clusterName="production-us-east"
# Install with bundled Prometheus stack
helm install robusta robusta/robusta \
--namespace robusta \
--create-namespace \
--values generated_values.yaml \
--set enablePrometheusStack=true
# Verify installation
kubectl get pods -n robusta
CLI Installation
# Install Robusta CLI
pip install robusta-cli
# Or via pipx
pipx install robusta-cli
# Verify
robusta version
# Generate initial configuration
robusta gen-config
# Test connectivity
robusta logs -n robusta
Core Commands — Alert Management
Enriched Alert Workflow
# generated_values.yaml — Alert enrichment configuration
globalConfig:
cluster_name: production-us-east
signing_key: "your-signing-key"
account_id: "your-account-id"
sinksConfig:
- slack_sink:
name: main-slack
slack_channel: "#k8s-alerts"
api_key: "xoxb-slack-bot-token"
- ms_teams_sink:
name: teams-alerts
webhook_url: "https://outlook.office.com/webhook/..."
- robusta_sink:
name: robusta-ui
token: "your-robusta-token"
# Built-in playbooks for common alerts
customPlaybooks:
# Enrich OOMKill alerts with memory graphs
- triggers:
- on_pod_oom_killed: {}
actions:
- oom_killer_enricher: {}
- logs_enricher:
container_tail_lines: 100
# Enrich CrashLoopBackOff
- triggers:
- on_pod_crash_loop:
restart_reason: CrashLoopBackOff
actions:
- crash_loop_reporter: {}
- logs_enricher:
container_tail_lines: 200
# Prometheus alert enrichment
- triggers:
- on_prometheus_alert:
alert_name: HighCPUUsage
actions:
- cpu_graph_enricher:
duration_minutes: 30
- pod_enricher: {}
- logs_enricher: {}
Deployment Tracking
# Track all deployment changes
customPlaybooks:
- triggers:
- on_deployment_update:
name_prefix: ""
namespace_prefix: ""
actions:
- deployment_status_enricher: {}
sinks:
- main-slack
# Alert on failed deployments
- triggers:
- on_deployment_update:
status: ["Failed"]
actions:
- deployment_status_enricher: {}
- logs_enricher: {}
- event_enricher: {}
sinks:
- main-slack
- robusta-ui
Core Commands — Built-in Actions
Investigation Actions
# Pod investigation playbook
customPlaybooks:
# Full pod investigation on any alert
- triggers:
- on_prometheus_alert:
alert_name: KubePodNotReady
actions:
- pod_enricher: {}
- logs_enricher:
container_tail_lines: 200
previous_container: true
- event_enricher: {}
- pod_graph_enricher:
resource_type: Memory
duration_minutes: 60
- pod_graph_enricher:
resource_type: CPU
duration_minutes: 60
- node_enricher: {}
# Node investigation
- triggers:
- on_prometheus_alert:
alert_name: KubeNodeNotReady
actions:
- node_enricher: {}
- node_graph_enricher:
resource_type: CPU
duration_minutes: 120
- node_graph_enricher:
resource_type: Memory
duration_minutes: 120
- event_enricher: {}
Remediation Actions
# Automated remediation playbooks
customPlaybooks:
# Auto-restart pods stuck in CrashLoopBackOff
- triggers:
- on_pod_crash_loop:
restart_reason: CrashLoopBackOff
restart_count: 10
actions:
- logs_enricher: {}
- delete_pod: {}
# Auto-scale deployment on high CPU
- triggers:
- on_prometheus_alert:
alert_name: HighCPUUsage
actions:
- cpu_graph_enricher: {}
- horizontal_pod_autoscaler:
max_replicas: 10
increase_pct: 50
# Restart deployment on memory leak
- triggers:
- on_prometheus_alert:
alert_name: MemoryLeakDetected
namespace: production
actions:
- pod_graph_enricher:
resource_type: Memory
- rollout_restart: {}
# Cordon node on disk pressure
- triggers:
- on_prometheus_alert:
alert_name: NodeDiskPressure
actions:
- node_enricher: {}
- cordon_node: {}
Configuration
Sink Configuration
# Multiple notification sinks
sinksConfig:
# Slack with routing
- slack_sink:
name: critical-alerts
slack_channel: "#critical-alerts"
api_key: "xoxb-token"
match:
severity: [HIGH, CRITICAL]
- slack_sink:
name: warning-alerts
slack_channel: "#k8s-warnings"
api_key: "xoxb-token"
match:
severity: [LOW, MEDIUM]
# PagerDuty integration
- pagerduty_sink:
name: pagerduty
api_key: "pagerduty-integration-key"
match:
severity: [CRITICAL]
# Jira ticket creation
- jira_sink:
name: jira-tickets
url: "https://company.atlassian.net"
username: "service-account@company.com"
api_key: "jira-api-token"
project_name: "OPS"
issue_type: "Bug"
match:
severity: [HIGH, CRITICAL]
# Webhook (generic)
- webhook_sink:
name: custom-webhook
url: "https://api.internal/k8s-events"
headers:
Authorization: "Bearer token"
# Robusta SaaS UI
- robusta_sink:
name: robusta-ui
token: "robusta-ui-token"
Playbook Filters
# Fine-grained playbook targeting
customPlaybooks:
# Only for specific namespaces
- triggers:
- on_pod_oom_killed:
namespace_prefix: "production"
actions:
- oom_killer_enricher: {}
sinks:
- critical-alerts
# Exclude system namespaces
- triggers:
- on_pod_crash_loop:
exclude_namespace: "kube-system"
actions:
- crash_loop_reporter: {}
# Label-based targeting
- triggers:
- on_prometheus_alert:
alert_name: HighLatency
pod_label_selector: "tier=frontend"
actions:
- logs_enricher: {}
- pod_graph_enricher: {}
# Time-based (silence during maintenance)
- triggers:
- on_prometheus_alert:
alert_name: ".*"
actions:
- silence_alert:
duration: 4h
when:
- schedule:
start: "02:00"
end: "06:00"
timezone: "UTC"
Advanced Usage
Custom Python Playbooks
# custom_playbooks/investigate_service.py
from robusta.api import *
@action
def investigate_service(event: PrometheusKubernetesAlert, params: dict):
"""Custom investigation playbook for service alerts."""
pod = event.get_pod()
if not pod:
return
# Add custom enrichment
finding = Finding(
title=f"Service Investigation: {pod.metadata.name}",
source=FindingSource.PROMETHEUS,
severity=FindingSeverity.HIGH,
)
# Gather service mesh data
finding.add_enrichment([
MarkdownBlock(f"**Pod:** {pod.metadata.name}"),
MarkdownBlock(f"**Namespace:** {pod.metadata.namespace}"),
MarkdownBlock(f"**Node:** {pod.spec.node_name}"),
])
# Check resource limits
for container in pod.spec.containers:
if container.resources and container.resources.limits:
cpu_limit = container.resources.limits.get("cpu", "not set")
mem_limit = container.resources.limits.get("memory", "not set")
finding.add_enrichment([
MarkdownBlock(f"**{container.name}** — CPU: {cpu_limit}, Memory: {mem_limit}"),
])
# Add logs
logs = pod.get_logs(container_name=pod.spec.containers[0].name, max_lines=50)
if logs:
finding.add_enrichment([
FileBlock("pod-logs.txt", logs.encode()),
])
event.add_finding(finding)
# Register custom playbook
customPlaybooks:
- triggers:
- on_prometheus_alert:
alert_name: ServiceDegraded
actions:
- investigate_service: {}
Robusta CLI Operations
# Trigger a playbook manually
robusta playbooks trigger pod_enricher -n production -p my-pod
# List running playbooks
robusta playbooks list
# View Robusta runner logs
robusta logs
# Test Slack connectivity
robusta demo
# Generate test alert to verify pipeline
robusta demo --alert-name TestAlert
# Upgrade Robusta
helm upgrade robusta robusta/robusta \
--namespace robusta \
--values generated_values.yaml
Multi-Cluster Setup
# Each cluster gets its own Robusta installation
# All report to the same Robusta SaaS UI
# Cluster 1 values
globalConfig:
cluster_name: "us-east-production"
account_id: "shared-account-id"
# Cluster 2 values
globalConfig:
cluster_name: "eu-west-production"
account_id: "shared-account-id"
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| No alerts in Slack | Slack token or channel wrong | Verify bot token has chat:write and channel exists |
| Playbook not triggering | Trigger conditions don’t match | Check alert name regex and namespace filters |
| OOM enrichment missing graphs | Prometheus not accessible | Verify Robusta can reach Prometheus URL |
| Custom playbook error | Python syntax or import issue | Check runner logs: kubectl logs -n robusta -l app=robusta-runner |
| High memory usage | Too many playbooks or large log collection | Reduce container_tail_lines and limit active playbooks |
| Duplicate notifications | Multiple sinks matching same severity | Use match filters to route alerts to specific sinks |
| Deployment tracking missed | Webhook not catching all events | Ensure Robusta has watch permissions on deployments |
| Jira tickets not created | API credentials incorrect | Verify Jira URL, username, and API token |
# Debug: check Robusta runner logs
kubectl logs -n robusta -l app=robusta-runner --tail=100
# Debug: check Robusta forwarder
kubectl logs -n robusta -l app=robusta-forwarder --tail=50
# Verify Prometheus connectivity from Robusta
kubectl exec -n robusta deploy/robusta-runner -- \
curl -s "http://prometheus-server.monitoring:9090/api/v1/query?query=up"
# Test playbook execution
robusta playbooks trigger event_enricher
# Check playbook configuration
kubectl get configmap -n robusta robusta-playbooks -o yaml
# Restart Robusta components
kubectl rollout restart deployment -n robusta robusta-runner
kubectl rollout restart deployment -n robusta robusta-forwarder