Salta ai contenuti

Sloth Cheat Sheet

Overview

Sloth is an SLO (Service Level Objective) framework that generates Prometheus recording rules and alerts from simple, declarative SLO specifications. Instead of manually crafting complex multi-window, multi-burn-rate alerting rules, Sloth takes a high-level SLO definition (service, SLI metric, target percentage, and alerting configuration) and outputs the complete set of Prometheus rules needed for accurate SLO monitoring following Google’s multi-window, multi-burn-rate approach from the SRE book.

Sloth supports multiple SLI plugin types and generates recording rules for 5m, 30m, 1h, 2h, 6h, 1d, 3d, and 30d windows, along with alerting rules at different burn rates. The generated rules power Grafana dashboards that show error budgets, burn rates, and SLO compliance over time. By treating SLOs as code, teams can version-control their reliability targets, review changes in pull requests, and maintain consistency across services.

Installation

# macOS
brew install slok/sloth/sloth

# Linux binary
curl -L "https://github.com/slok/sloth/releases/latest/download/sloth-linux-amd64" -o /usr/local/bin/sloth
chmod +x /usr/local/bin/sloth

# Go install
go install github.com/slok/sloth/cmd/sloth@latest

# Docker
docker pull ghcr.io/slok/sloth:latest

# Kubernetes controller (watches SLO CRDs)
kubectl apply -f https://raw.githubusercontent.com/slok/sloth/main/deploy/kubernetes/raw/sloth.yaml

# Verify installation
sloth version

Core Commands

Generating Prometheus Rules

# Generate rules from SLO spec
sloth generate -i slo.yaml -o rules.yaml

# Generate from multiple files
sloth generate -i slos/ -o prometheus-rules/

# Generate and output to stdout
sloth generate -i slo.yaml

# Validate SLO spec without generating
sloth validate -i slo.yaml

# Generate with custom window period (default 30d)
sloth generate -i slo.yaml -o rules.yaml --window-period 7d

# Generate with extra labels
sloth generate -i slo.yaml -o rules.yaml \
  --extra-labels "env=production,cluster=us-east"

# Docker usage
docker run --rm -v $(pwd):/data ghcr.io/slok/sloth:latest \
  generate -i /data/slo.yaml -o /data/rules.yaml

Core Concepts — SLO Definitions

Basic SLO Specification

# slo.yaml — Basic availability SLO
version: "prometheus/v1"
service: "payment-api"
labels:
  team: payments
  tier: "0"
slos:
  - name: "requests-availability"
    objective: 99.9
    description: "99.9% of payment API requests succeed"
    sli:
      events:
        error_query: sum(rate(http_requests_total{service="payment-api",code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{service="payment-api"}[{{.window}}]))
    alerting:
      name: PaymentAPIHighErrorRate
      labels:
        severity: critical
        team: payments
      annotations:
        summary: "Payment API error rate burning SLO budget"
        runbook: "https://runbooks.internal/payment-api-errors"
      page_alert:
        labels:
          severity: critical
          channel: pagerduty
      ticket_alert:
        labels:
          severity: warning
          channel: slack

Latency SLO

# latency-slo.yaml
version: "prometheus/v1"
service: "checkout-api"
labels:
  team: commerce
slos:
  - name: "requests-latency"
    objective: 99.0
    description: "99% of checkout requests complete within 500ms"
    sli:
      events:
        error_query: >
          sum(rate(http_request_duration_seconds_bucket{
            service="checkout-api",
            le="0.5"
          }[{{.window}}]))
          -
          sum(rate(http_request_duration_seconds_count{
            service="checkout-api"
          }[{{.window}}]))
        total_query: >
          sum(rate(http_request_duration_seconds_count{
            service="checkout-api"
          }[{{.window}}]))
    alerting:
      name: CheckoutLatencyBudgetBurn
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Multiple SLOs Per Service

# multi-slo.yaml
version: "prometheus/v1"
service: "api-gateway"
labels:
  team: platform
  environment: production
slos:
  - name: "availability"
    objective: 99.95
    description: "API gateway availability"
    sli:
      events:
        error_query: sum(rate(envoy_cluster_upstream_rq{response_code_class!="2xx",cluster="api-gateway"}[{{.window}}]))
        total_query: sum(rate(envoy_cluster_upstream_rq{cluster="api-gateway"}[{{.window}}]))
    alerting:
      name: APIGatewayAvailability
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

  - name: "latency-p99"
    objective: 99.0
    description: "99% of requests under 200ms"
    sli:
      events:
        error_query: >
          sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster="api-gateway",le="200"}[{{.window}}]))
          -
          sum(rate(envoy_cluster_upstream_rq_time_count{cluster="api-gateway"}[{{.window}}]))
        total_query: sum(rate(envoy_cluster_upstream_rq_time_count{cluster="api-gateway"}[{{.window}}]))
    alerting:
      name: APIGatewayLatency
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

  - name: "throughput"
    objective: 99.5
    description: "API gateway handles expected throughput"
    sli:
      raw:
        error_ratio_query: >
          1 - clamp_max(
            sum(rate(envoy_cluster_upstream_rq{cluster="api-gateway"}[{{.window}}])) /
            1000,
            1
          )
    alerting:
      name: APIGatewayThroughputDrop
      ticket_alert:
        labels:
          severity: warning

Configuration

Kubernetes CRD Mode

# Deploy Sloth as a Kubernetes controller
# It watches PrometheusServiceLevel CRDs and generates PrometheusRule CRs

# slo-crd.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: payment-api-slos
  namespace: monitoring
spec:
  service: "payment-api"
  labels:
    team: payments
  slos:
    - name: "availability"
      objective: 99.9
      description: "Payment API availability"
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{service="payment-api",code=~"5.."}[{{.window}}]))
          totalQuery: sum(rate(http_requests_total{service="payment-api"}[{{.window}}]))
      alerting:
        name: PaymentAPIAvailability
        pageAlert:
          labels:
            severity: critical
        ticketAlert:
          labels:
            severity: warning
# Apply SLO CRD
kubectl apply -f slo-crd.yaml

# Check generated PrometheusRule
kubectl get prometheusrules -n monitoring | grep payment-api

# View generated rules
kubectl get prometheusrule payment-api-slos -n monitoring -o yaml

Integration with Prometheus

# Load generated rules into Prometheus
# prometheus.yml
rule_files:
  - "/etc/prometheus/rules/slo-rules/*.yaml"

# Or if using Prometheus Operator, the PrometheusRule CRs
# are automatically picked up

Grafana Dashboard

# Sloth generates recording rules with standardized metric names:
# slo:sli_error:ratio_rate5m
# slo:sli_error:ratio_rate30m
# slo:sli_error:ratio_rate1h
# slo:sli_error:ratio_rate2h
# slo:sli_error:ratio_rate6h
# slo:sli_error:ratio_rate1d
# slo:sli_error:ratio_rate3d
# slo:sli_error:ratio_rate30d
# slo:objective:ratio
# slo:error_budget:ratio
# slo:time_period:days
# slo:current_burn_rate:ratio
# slo:period_burn_rate:ratio
# slo:period_error_budget_remaining:ratio

# Import the Sloth Grafana dashboard
# Dashboard ID: 14348 (Sloth SLO Dashboard)
# Or download from:
# https://grafana.com/grafana/dashboards/14348

Advanced Usage

SLI Plugins

# Using built-in SLI plugins for common patterns

# HTTP availability plugin
version: "prometheus/v1"
service: "my-api"
slos:
  - name: "availability"
    objective: 99.9
    sli:
      plugin:
        id: "sloth-common/http/availability"
        options:
          filter: 'job="my-api"'
          code_regex: "5.."
    alerting:
      name: MyAPIAvailability
      page_alert: {}
      ticket_alert: {}

---
# gRPC availability plugin
version: "prometheus/v1"
service: "my-grpc-service"
slos:
  - name: "availability"
    objective: 99.9
    sli:
      plugin:
        id: "sloth-common/grpc/availability"
        options:
          filter: 'job="my-grpc-service"'
          grpc_status_regex: "(Unknown|Internal|Unavailable)"
    alerting:
      name: GRPCServiceAvailability
      page_alert: {}
      ticket_alert: {}

Custom SLI Plugins

# Create a custom SLI plugin
mkdir -p ~/.sloth/plugins

cat > ~/.sloth/plugins/custom-kafka-lag.go.tmpl << 'EOF'
{{/* Custom plugin for Kafka consumer lag SLI */}}
{{ $filter := .options.filter }}
{{ $maxLag := .options.max_lag | default "1000" }}

error_query: |
  sum(rate(kafka_consumer_lag{
    {{ $filter }}
  }[{{ "{{.window}}" }}])) > {{ $maxLag }}

total_query: |
  count(kafka_consumer_lag{
    {{ $filter }}
  }) * rate(kafka_consumer_lag{
    {{ $filter }}
  }[{{ "{{.window}}" }}])
EOF

# Use custom plugin
# sli:
#   plugin:
#     id: "custom-kafka-lag"
#     options:
#       filter: 'consumer_group="my-consumer"'
#       max_lag: "5000"

CI/CD Validation

#!/bin/bash
# validate-slos.sh — CI pipeline step to validate SLO changes
set -euo pipefail

echo "Validating SLO specifications..."
for file in slos/*.yaml; do
  echo "Checking: $file"
  sloth validate -i "$file"
  if [ $? -ne 0 ]; then
    echo "FAILED: $file has validation errors"
    exit 1
  fi
done

echo "Generating rules to verify output..."
sloth generate -i slos/ -o /tmp/generated-rules/
echo "All SLO specs are valid."

# Optionally check for objective changes
git diff --name-only HEAD~1 -- 'slos/*.yaml' | while read changed; do
  OLD_OBJ=$(git show HEAD~1:"$changed" | grep "objective:" | awk '{print $2}')
  NEW_OBJ=$(grep "objective:" "$changed" | awk '{print $2}')
  if [ "$OLD_OBJ" != "$NEW_OBJ" ]; then
    echo "WARNING: SLO objective changed in $changed: $OLD_OBJ -> $NEW_OBJ"
  fi
done

Troubleshooting

IssueCauseSolution
Generated rules not loadingPrometheus rule_files path wrongVerify path in prometheus.yml matches output location
Alert never firingBurn rate windows too conservativeCheck multi-burn-rate thresholds; verify SLI queries return data
Error budget always 100%SLI error_query returns 0Verify error_query matches actual error metrics
Recording rules showing NaNNo matching time seriesEnsure service metrics are being scraped
{{.window}} not expandingWrong template syntaxUse {{.window}} not {{ .window }} in queries
CRD not generating PrometheusRuleSloth controller not runningCheck sloth controller pods: kubectl get pods -n monitoring
Dashboard showing no dataRecording rules not yet evaluatedWait for Prometheus evaluation cycle (default 1m)
Plugin not foundPlugin not in search pathSet --sli-plugins-path or use ~/.sloth/plugins/
# Debug: verify generated rules are syntactically valid
promtool check rules rules.yaml

# Debug: verify SLI queries return data
# Test error query
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{code=~'5..'}[5m]))"

# Test total query  
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total[5m]))"

# Check recording rules are being evaluated
curl -s "http://prometheus:9090/api/v1/rules" | jq '.data.groups[] | select(.name | contains("sloth"))'

# Validate all SLO files
find slos/ -name "*.yaml" -exec sloth validate -i {} \;