Sloth Cheat Sheet
Overview
Sloth is an SLO (Service Level Objective) framework that generates Prometheus recording rules and alerts from simple, declarative SLO specifications. Instead of manually crafting complex multi-window, multi-burn-rate alerting rules, Sloth takes a high-level SLO definition (service, SLI metric, target percentage, and alerting configuration) and outputs the complete set of Prometheus rules needed for accurate SLO monitoring following Google’s multi-window, multi-burn-rate approach from the SRE book.
Sloth supports multiple SLI plugin types and generates recording rules for 5m, 30m, 1h, 2h, 6h, 1d, 3d, and 30d windows, along with alerting rules at different burn rates. The generated rules power Grafana dashboards that show error budgets, burn rates, and SLO compliance over time. By treating SLOs as code, teams can version-control their reliability targets, review changes in pull requests, and maintain consistency across services.
Installation
# macOS
brew install slok/sloth/sloth
# Linux binary
curl -L "https://github.com/slok/sloth/releases/latest/download/sloth-linux-amd64" -o /usr/local/bin/sloth
chmod +x /usr/local/bin/sloth
# Go install
go install github.com/slok/sloth/cmd/sloth@latest
# Docker
docker pull ghcr.io/slok/sloth:latest
# Kubernetes controller (watches SLO CRDs)
kubectl apply -f https://raw.githubusercontent.com/slok/sloth/main/deploy/kubernetes/raw/sloth.yaml
# Verify installation
sloth version
Core Commands
Generating Prometheus Rules
# Generate rules from SLO spec
sloth generate -i slo.yaml -o rules.yaml
# Generate from multiple files
sloth generate -i slos/ -o prometheus-rules/
# Generate and output to stdout
sloth generate -i slo.yaml
# Validate SLO spec without generating
sloth validate -i slo.yaml
# Generate with custom window period (default 30d)
sloth generate -i slo.yaml -o rules.yaml --window-period 7d
# Generate with extra labels
sloth generate -i slo.yaml -o rules.yaml \
--extra-labels "env=production,cluster=us-east"
# Docker usage
docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest \
generate -i /data/slo.yaml -o /data/rules.yaml
Core Concepts — SLO Definitions
Basic SLO Specification
# slo.yaml — Basic availability SLO
version: "prometheus/v1"
service: "payment-api"
labels:
team: payments
tier: "0"
slos:
- name: "requests-availability"
objective: 99.9
description: "99.9% of payment API requests succeed"
sli:
events:
error_query: sum(rate(http_requests_total{service="payment-api",code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total{service="payment-api"}[{{.window}}]))
alerting:
name: PaymentAPIHighErrorRate
labels:
severity: critical
team: payments
annotations:
summary: "Payment API error rate burning SLO budget"
runbook: "https://runbooks.internal/payment-api-errors"
page_alert:
labels:
severity: critical
channel: pagerduty
ticket_alert:
labels:
severity: warning
channel: slack
Latency SLO
# latency-slo.yaml
version: "prometheus/v1"
service: "checkout-api"
labels:
team: commerce
slos:
- name: "requests-latency"
objective: 99.0
description: "99% of checkout requests complete within 500ms"
sli:
events:
error_query: >
sum(rate(http_request_duration_seconds_bucket{
service="checkout-api",
le="0.5"
}[{{.window}}]))
-
sum(rate(http_request_duration_seconds_count{
service="checkout-api"
}[{{.window}}]))
total_query: >
sum(rate(http_request_duration_seconds_count{
service="checkout-api"
}[{{.window}}]))
alerting:
name: CheckoutLatencyBudgetBurn
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Multiple SLOs Per Service
# multi-slo.yaml
version: "prometheus/v1"
service: "api-gateway"
labels:
team: platform
environment: production
slos:
- name: "availability"
objective: 99.95
description: "API gateway availability"
sli:
events:
error_query: sum(rate(envoy_cluster_upstream_rq{response_code_class!="2xx",cluster="api-gateway"}[{{.window}}]))
total_query: sum(rate(envoy_cluster_upstream_rq{cluster="api-gateway"}[{{.window}}]))
alerting:
name: APIGatewayAvailability
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
- name: "latency-p99"
objective: 99.0
description: "99% of requests under 200ms"
sli:
events:
error_query: >
sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster="api-gateway",le="200"}[{{.window}}]))
-
sum(rate(envoy_cluster_upstream_rq_time_count{cluster="api-gateway"}[{{.window}}]))
total_query: sum(rate(envoy_cluster_upstream_rq_time_count{cluster="api-gateway"}[{{.window}}]))
alerting:
name: APIGatewayLatency
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
- name: "throughput"
objective: 99.5
description: "API gateway handles expected throughput"
sli:
raw:
error_ratio_query: >
1 - clamp_max(
sum(rate(envoy_cluster_upstream_rq{cluster="api-gateway"}[{{.window}}])) /
1000,
1
)
alerting:
name: APIGatewayThroughputDrop
ticket_alert:
labels:
severity: warning
Configuration
Kubernetes CRD Mode
# Deploy Sloth as a Kubernetes controller
# It watches PrometheusServiceLevel CRDs and generates PrometheusRule CRs
# slo-crd.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: payment-api-slos
namespace: monitoring
spec:
service: "payment-api"
labels:
team: payments
slos:
- name: "availability"
objective: 99.9
description: "Payment API availability"
sli:
events:
errorQuery: sum(rate(http_requests_total{service="payment-api",code=~"5.."}[{{.window}}]))
totalQuery: sum(rate(http_requests_total{service="payment-api"}[{{.window}}]))
alerting:
name: PaymentAPIAvailability
pageAlert:
labels:
severity: critical
ticketAlert:
labels:
severity: warning
# Apply SLO CRD
kubectl apply -f slo-crd.yaml
# Check generated PrometheusRule
kubectl get prometheusrules -n monitoring | grep payment-api
# View generated rules
kubectl get prometheusrule payment-api-slos -n monitoring -o yaml
Integration with Prometheus
# Load generated rules into Prometheus
# prometheus.yml
rule_files:
- "/etc/prometheus/rules/slo-rules/*.yaml"
# Or if using Prometheus Operator, the PrometheusRule CRs
# are automatically picked up
Grafana Dashboard
# Sloth generates recording rules with standardized metric names:
# slo:sli_error:ratio_rate5m
# slo:sli_error:ratio_rate30m
# slo:sli_error:ratio_rate1h
# slo:sli_error:ratio_rate2h
# slo:sli_error:ratio_rate6h
# slo:sli_error:ratio_rate1d
# slo:sli_error:ratio_rate3d
# slo:sli_error:ratio_rate30d
# slo:objective:ratio
# slo:error_budget:ratio
# slo:time_period:days
# slo:current_burn_rate:ratio
# slo:period_burn_rate:ratio
# slo:period_error_budget_remaining:ratio
# Import the Sloth Grafana dashboard
# Dashboard ID: 14348 (Sloth SLO Dashboard)
# Or download from:
# https://grafana.com/grafana/dashboards/14348
Advanced Usage
SLI Plugins
# Using built-in SLI plugins for common patterns
# HTTP availability plugin
version: "prometheus/v1"
service: "my-api"
slos:
- name: "availability"
objective: 99.9
sli:
plugin:
id: "sloth-common/http/availability"
options:
filter: 'job="my-api"'
code_regex: "5.."
alerting:
name: MyAPIAvailability
page_alert: {}
ticket_alert: {}
---
# gRPC availability plugin
version: "prometheus/v1"
service: "my-grpc-service"
slos:
- name: "availability"
objective: 99.9
sli:
plugin:
id: "sloth-common/grpc/availability"
options:
filter: 'job="my-grpc-service"'
grpc_status_regex: "(Unknown|Internal|Unavailable)"
alerting:
name: GRPCServiceAvailability
page_alert: {}
ticket_alert: {}
Custom SLI Plugins
# Create a custom SLI plugin
mkdir -p ~/.sloth/plugins
cat > ~/.sloth/plugins/custom-kafka-lag.go.tmpl << 'EOF'
{{/* Custom plugin for Kafka consumer lag SLI */}}
{{ $filter := .options.filter }}
{{ $maxLag := .options.max_lag | default "1000" }}
error_query: |
sum(rate(kafka_consumer_lag{
{{ $filter }}
}[{{ "{{.window}}" }}])) > {{ $maxLag }}
total_query: |
count(kafka_consumer_lag{
{{ $filter }}
}) * rate(kafka_consumer_lag{
{{ $filter }}
}[{{ "{{.window}}" }}])
EOF
# Use custom plugin
# sli:
# plugin:
# id: "custom-kafka-lag"
# options:
# filter: 'consumer_group="my-consumer"'
# max_lag: "5000"
CI/CD Validation
#!/bin/bash
# validate-slos.sh — CI pipeline step to validate SLO changes
set -euo pipefail
echo "Validating SLO specifications..."
for file in slos/*.yaml; do
echo "Checking: $file"
sloth validate -i "$file"
if [ $? -ne 0 ]; then
echo "FAILED: $file has validation errors"
exit 1
fi
done
echo "Generating rules to verify output..."
sloth generate -i slos/ -o /tmp/generated-rules/
echo "All SLO specs are valid."
# Optionally check for objective changes
git diff --name-only HEAD~1 -- 'slos/*.yaml' | while read changed; do
OLD_OBJ=$(git show HEAD~1:"$changed" | grep "objective:" | awk '{print $2}')
NEW_OBJ=$(grep "objective:" "$changed" | awk '{print $2}')
if [ "$OLD_OBJ" != "$NEW_OBJ" ]; then
echo "WARNING: SLO objective changed in $changed: $OLD_OBJ -> $NEW_OBJ"
fi
done
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Generated rules not loading | Prometheus rule_files path wrong | Verify path in prometheus.yml matches output location |
| Alert never firing | Burn rate windows too conservative | Check multi-burn-rate thresholds; verify SLI queries return data |
| Error budget always 100% | SLI error_query returns 0 | Verify error_query matches actual error metrics |
| Recording rules showing NaN | No matching time series | Ensure service metrics are being scraped |
{{.window}} not expanding | Wrong template syntax | Use {{.window}} not {{ .window }} in queries |
| CRD not generating PrometheusRule | Sloth controller not running | Check sloth controller pods: kubectl get pods -n monitoring |
| Dashboard showing no data | Recording rules not yet evaluated | Wait for Prometheus evaluation cycle (default 1m) |
| Plugin not found | Plugin not in search path | Set --sli-plugins-path or use ~/.sloth/plugins/ |
# Debug: verify generated rules are syntactically valid
promtool check rules rules.yaml
# Debug: verify SLI queries return data
# Test error query
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{code=~'5..'}[5m]))"
# Test total query
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total[5m]))"
# Check recording rules are being evaluated
curl -s "http://prometheus:9090/api/v1/rules" | jq '.data.groups[] | select(.name | contains("sloth"))'
# Validate all SLO files
find slos/ -name "*.yaml" -exec sloth validate -i {} \;