Apache Druid Cheat Sheet

Overview

Apache Druid is a high-performance, real-time analytics database designed for workflows where fast queries and ingest are important. It combines ideas from time-series databases, data warehouses, and search systems to deliver sub-second OLAP queries at scale. Druid stores data in a custom columnar format with automatic indexing, bitmap indexes, and dictionary encoding, enabling it to scan billions of rows in milliseconds.

Druid excels at powering user-facing analytics applications, real-time dashboards, and operational monitoring where query latency matters. It supports both streaming ingestion from Kafka/Kinesis and batch ingestion from HDFS/S3. The architecture separates query processing, data management, and ingestion into independent services that can be scaled independently. Druid integrates with BI tools via its SQL interface and supports native JSON queries for advanced analytics.

Installation

Docker Quickstart

# Clone Druid repository for docker-compose
git clone https://github.com/apache/druid.git
cd druid
git checkout 29.0.1

# Start with Docker Compose (single-server micro-quickstart)
docker compose -f distribution/docker/docker-compose.yml up -d

# Access Druid Console at http://localhost:8888

Standalone Installation

# Download Druid
wget https://dlcdn.apache.org/druid/29.0.1/apache-druid-29.0.1-bin.tar.gz
tar -xzf apache-druid-29.0.1-bin.tar.gz
cd apache-druid-29.0.1

# Start single-server (micro-quickstart)
./bin/start-druid

# Or start with specific configuration
./bin/start-druid -c conf/druid/single-server/medium

Kubernetes via Helm

helm repo add druid https://druid-io.github.io/druid-helm/
helm install druid druid/druid \
    --namespace druid \
    --create-namespace \
    --set image.tag=29.0.1

SQL Queries

-- Basic aggregation query
SELECT
    DATE_TRUNC('hour', __time) AS hour,
    country,
    COUNT(*) AS event_count,
    SUM(revenue) AS total_revenue,
    AVG(latency_ms) AS avg_latency
FROM events
WHERE __time >= TIMESTAMP '2024-06-01'
  AND __time < TIMESTAMP '2024-07-01'
GROUP BY 1, 2
ORDER BY total_revenue DESC
LIMIT 100;

-- Approximate count distinct
SELECT
    page,
    APPROX_COUNT_DISTINCT(user_id) AS unique_visitors,
    COUNT(*) AS page_views
FROM web_events
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY page
HAVING COUNT(*) > 100
ORDER BY unique_visitors DESC;

-- Time-series with granularity
SELECT
    TIME_FLOOR(__time, 'PT5M') AS five_min_bucket,
    COUNT(*) AS requests,
    APPROX_QUANTILE_DS(response_time, 0.99) AS p99_response
FROM api_requests
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY 1
ORDER BY 1;

-- Top-N query
SELECT
    product_category,
    SUM(sales_amount) AS total_sales
FROM sales
WHERE __time >= TIMESTAMP '2024-01-01'
GROUP BY product_category
ORDER BY total_sales DESC
LIMIT 10;

Ingestion

Kafka Streaming Ingestion

{
  "type": "kafka",
  "spec": {
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "kafka:9092"
      },
      "topic": "events",
      "inputFormat": {
        "type": "json"
      },
      "useEarliestOffset": true
    },
    "tuningConfig": {
      "type": "kafka",
      "maxRowsPerSegment": 5000000,
      "maxRowsInMemory": 100000,
      "intermediatePersistPeriod": "PT10M"
    },
    "dataSchema": {
      "dataSource": "events",
      "timestampSpec": {
        "column": "event_time",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "user_id",
          "event_type",
          "country",
          {"name": "device", "type": "string"},
          {"name": "is_mobile", "type": "long"}
        ]
      },
      "granularitySpec": {
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      },
      "metricsSpec": [
        {"type": "count", "name": "count"},
        {"type": "longSum", "name": "total_events", "fieldName": "event_count"},
        {"type": "doubleSum", "name": "revenue", "fieldName": "amount"},
        {"type": "HLLSketchBuild", "name": "unique_users", "fieldName": "user_id"}
      ]
    }
  }
}

Batch Ingestion (S3)

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "uris": ["s3://bucket/data/2024/06/"],
        "properties": {
          "accessKeyId": {"type": "environment", "variable": "AWS_ACCESS_KEY_ID"},
          "secretAccessKey": {"type": "environment", "variable": "AWS_SECRET_ACCESS_KEY"}
        }
      },
      "inputFormat": {
        "type": "parquet"
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxNumConcurrentSubTasks": 4,
      "partitionsSpec": {
        "type": "hashed",
        "numShards": 8,
        "partitionDimensions": ["country"]
      }
    },
    "dataSchema": {
      "dataSource": "historical_events",
      "timestampSpec": {"column": "timestamp", "format": "iso"},
      "dimensionsSpec": {"dimensions": ["user_id", "event_type", "country"]},
      "granularitySpec": {"segmentGranularity": "DAY", "queryGranularity": "HOUR"}
    }
  }
}

API Commands

Endpoint	Description
`POST /druid/v2/sql`	Execute SQL query
`POST /druid/indexer/v1/supervisor`	Submit streaming ingestion supervisor
`GET /druid/indexer/v1/supervisor`	List all supervisors
`POST /druid/indexer/v1/supervisor/{id}/suspend`	Suspend a supervisor
`POST /druid/indexer/v1/supervisor/{id}/resume`	Resume a supervisor
`POST /druid/indexer/v1/supervisor/{id}/terminate`	Terminate a supervisor
`POST /druid/indexer/v1/task`	Submit a batch ingestion task
`GET /druid/indexer/v1/task/{id}/status`	Get task status
`GET /druid/coordinator/v1/datasources`	List all datasources
`DELETE /druid/coordinator/v1/datasources/{ds}`	Disable a datasource
`POST /druid/coordinator/v1/datasources/{ds}`	Enable a datasource
`GET /status`	Get server status and version

# Execute SQL query via API
curl -X POST http://localhost:8888/druid/v2/sql \
    -H "Content-Type: application/json" \
    -d '{"query": "SELECT COUNT(*) FROM events WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '\''1'\'' HOUR"}'

# Submit ingestion spec
curl -X POST http://localhost:8888/druid/indexer/v1/supervisor \
    -H "Content-Type: application/json" \
    -d @kafka-ingestion.json

# Check supervisor status
curl http://localhost:8888/druid/indexer/v1/supervisor/events/status | jq

Configuration

Common Runtime Properties

# common.runtime.properties
druid.extensions.loadList=["druid-kafka-indexing-service","druid-s3-extensions","druid-hdfs-storage","druid-parquet-extensions"]

# Deep storage (S3)
druid.storage.type=s3
druid.storage.bucket=druid-segments
druid.storage.baseKey=druid/segments
druid.s3.accessKey=...
druid.s3.secretKey=...

# Metadata storage (PostgreSQL)
druid.metadata.storage.type=postgresql
druid.metadata.storage.connector.connectURI=jdbc:postgresql://postgres:5432/druid
druid.metadata.storage.connector.user=druid
druid.metadata.storage.connector.password=druid

# ZooKeeper
druid.zk.service.host=zk1:2181,zk2:2181,zk3:2181

Advanced Usage

Rollup and Pre-Aggregation

{
  "granularitySpec": {
    "segmentGranularity": "HOUR",
    "queryGranularity": "MINUTE",
    "rollup": true
  },
  "metricsSpec": [
    {"type": "count", "name": "count"},
    {"type": "longSum", "name": "total_bytes", "fieldName": "bytes"},
    {"type": "doubleMin", "name": "min_latency", "fieldName": "latency"},
    {"type": "doubleMax", "name": "max_latency", "fieldName": "latency"},
    {"type": "thetaSketch", "name": "unique_users", "fieldName": "user_id", "size": 16384}
  ]
}

Compaction

{
  "type": "compact",
  "dataSource": "events",
  "interval": "2024-06-01/2024-07-01",
  "tuningConfig": {
    "type": "index_parallel",
    "maxRowsPerSegment": 5000000,
    "maxNumConcurrentSubTasks": 4,
    "partitionsSpec": {
      "type": "range",
      "partitionDimensions": ["country"],
      "targetRowsPerSegment": 5000000
    }
  }
}

Multi-Stage Query Engine (MSQ)

-- Insert using MSQ (for batch ETL within Druid)
INSERT INTO daily_summary
SELECT
    TIME_FLOOR(__time, 'P1D') AS __time,
    country,
    COUNT(*) AS event_count,
    SUM(revenue) AS total_revenue
FROM events
WHERE __time >= TIMESTAMP '2024-06-01'
  AND __time < TIMESTAMP '2024-07-01'
GROUP BY 1, 2
PARTITIONED BY DAY
CLUSTERED BY country;

-- Replace (overwrite) segments
REPLACE INTO daily_summary OVERWRITE ALL
SELECT ... FROM events
PARTITIONED BY DAY;

Troubleshooting

Issue	Solution
Slow queries	Check segment sizes (target ~5M rows). Verify partition pruning on `__time`
Ingestion lag	Increase `maxRowsInMemory` and task count. Check Kafka consumer lag
Segments not loading	Check coordinator logs. Verify deep storage connectivity and Historical capacity
OOM on Historical nodes	Increase `-Xmx` in jvm.config. Check segment cache size configuration
Query timeout	Increase `druid.server.http.defaultQueryTimeout`. Optimize query to use rollup
Supervisor in UNHEALTHY state	Check Kafka connectivity. Restart supervisor via API
Compaction not running	Verify auto-compaction is configured. Check coordinator duty cycle
High query latency variance	Check for segment skew. Use range partitioning for better distribution
Metadata DB connection issues	Verify PostgreSQL/MySQL connectivity. Check connection pool settings
ZooKeeper session expired	Check ZK cluster health. Increase `druid.zk.service.sessionTimeoutMs`