Apache Pinot Cheat Sheet

Overview

Apache Pinot is a real-time distributed OLAP datastore designed to deliver ultra-low-latency analytics at high throughput. Originally built at LinkedIn to power analytics dashboards that serve millions of users, Pinot is optimized for serving analytical queries that need sub-second response times even at massive scale. It combines the best of columnar databases and search engines with custom indexing strategies.

Pinot supports both real-time ingestion from streaming platforms (Kafka, Kinesis) and batch ingestion from data lakes (HDFS, S3). Its architecture separates concerns into controllers (cluster management), brokers (query routing), servers (data storage and query execution), and minions (background tasks). Pinot provides a SQL interface, automatic segment management, multi-value columns, text search, geospatial queries, and pluggable index types including inverted, star-tree, range, and bloom filter indexes.

Installation

Docker Quickstart

# Quick start with Docker
docker run -d \
    --name pinot-quickstart \
    -p 9000:9000 \
    -p 8000:8000 \
    apachepinot/pinot:1.1.0 QuickStart \
    -type batch

# Access Pinot Query Console at http://localhost:9000

Docker Compose (Full Cluster)

version: '3.8'
services:
  zookeeper:
    image: zookeeper:3.9
    ports: ["2181:2181"]

  pinot-controller:
    image: apachepinot/pinot:1.1.0
    command: StartController -zkAddress zookeeper:2181
    ports: ["9000:9000"]
    depends_on: [zookeeper]

  pinot-broker:
    image: apachepinot/pinot:1.1.0
    command: StartBroker -zkAddress zookeeper:2181
    ports: ["8099:8099"]
    depends_on: [pinot-controller]

  pinot-server:
    image: apachepinot/pinot:1.1.0
    command: StartServer -zkAddress zookeeper:2181
    ports: ["8098:8098"]
    depends_on: [pinot-controller]

  pinot-minion:
    image: apachepinot/pinot:1.1.0
    command: StartMinion -zkAddress zookeeper:2181
    depends_on: [pinot-controller]

Standalone Installation

# Download Pinot
wget https://downloads.apache.org/pinot/apache-pinot-1.1.0/apache-pinot-1.1.0-bin.tar.gz
tar -xzf apache-pinot-1.1.0-bin.tar.gz
cd apache-pinot-1.1.0-bin

# Start all components
bin/pinot-admin.sh QuickStart -type batch

Schema and Table Configuration

Schema Definition

{
  "schemaName": "events",
  "dimensionFieldSpecs": [
    {"name": "event_id", "dataType": "STRING"},
    {"name": "user_id", "dataType": "STRING"},
    {"name": "event_type", "dataType": "STRING"},
    {"name": "country", "dataType": "STRING"},
    {"name": "device", "dataType": "STRING"},
    {"name": "tags", "dataType": "STRING", "singleValueField": false}
  ],
  "metricFieldSpecs": [
    {"name": "revenue", "dataType": "DOUBLE"},
    {"name": "duration_ms", "dataType": "LONG"},
    {"name": "item_count", "dataType": "INT"}
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "event_time",
      "dataType": "TIMESTAMP",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ],
  "primaryKeyColumns": ["event_id"]
}

Realtime Table Config (Kafka)

{
  "tableName": "events_REALTIME",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "event_time",
    "schemaName": "events",
    "replication": "2",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "30",
    "segmentPushType": "APPEND"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": ["event_type", "country"],
    "rangeIndexColumns": ["revenue", "duration_ms"],
    "bloomFilterColumns": ["user_id"],
    "noDictionaryColumns": ["event_id"],
    "sortedColumn": ["event_time"],
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "events",
      "stream.kafka.broker.list": "kafka:9092",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "realtime.segment.flush.threshold.rows": "100000",
      "realtime.segment.flush.threshold.time": "1h"
    }
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "metadata": {}
}

Offline Table Config

{
  "tableName": "events_OFFLINE",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "timeColumnName": "event_time",
    "schemaName": "events",
    "replication": "2",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "365"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": ["event_type", "country"],
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": ["country", "event_type"],
        "functionColumnPairs": ["COUNT__*", "SUM__revenue"],
        "maxLeafRecords": 10000
      }
    ]
  }
}

SQL Queries

-- Basic aggregation
SELECT
    country,
    COUNT(*) AS event_count,
    SUM(revenue) AS total_revenue,
    AVG(duration_ms) AS avg_duration
FROM events
WHERE event_time > ago('P7D')
GROUP BY country
ORDER BY total_revenue DESC
LIMIT 20;

-- Time-series query
SELECT
    DATETIMECONVERT(event_time, '1:MILLISECONDS:EPOCH', '1:MINUTES:EPOCH', '5:MINUTES') AS time_bucket,
    COUNT(*) AS requests,
    PERCENTILEEST(duration_ms, 99) AS p99_latency
FROM events
WHERE event_time > ago('PT1H')
GROUP BY time_bucket
ORDER BY time_bucket;

-- Multi-value column query
SELECT
    tags,
    COUNT(*) AS count
FROM events
WHERE tags IN ('premium', 'returning')
  AND event_time > ago('P1D')
GROUP BY tags
ORDER BY count DESC;

-- Distinct count with HyperLogLog
SELECT
    country,
    DISTINCTCOUNTHLL(user_id) AS unique_users
FROM events
WHERE event_time > ago('P30D')
GROUP BY country;

-- Text search
SELECT event_id, description
FROM support_tickets
WHERE TEXT_MATCH(description, 'payment AND (failed OR error)')
LIMIT 50;

CLI and API Commands

Command	Description
`bin/pinot-admin.sh AddTable -tableConfigFile config.json -schemaFile schema.json`	Create table with schema
`bin/pinot-admin.sh AddSchema -schemaFile schema.json`	Add schema
`bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile job.yaml`	Run batch ingestion
`bin/pinot-admin.sh PostQuery -brokerPort 8099 -query "SELECT ..."`	Run query from CLI
`bin/pinot-admin.sh ChangeTableState -tableName t -state enable`	Enable/disable table
`bin/pinot-admin.sh DeleteTable -tableName t`	Delete a table
`bin/pinot-admin.sh RebalanceTable -tableName t`	Rebalance segments

# REST API examples
# Create schema
curl -X POST http://localhost:9000/schemas -H "Content-Type: application/json" -d @schema.json

# Create table
curl -X POST http://localhost:9000/tables -H "Content-Type: application/json" -d @table-config.json

# Query via broker
curl -X POST http://localhost:8099/query/sql -H "Content-Type: application/json" \
    -d '{"sql": "SELECT COUNT(*) FROM events"}'

# Get table size
curl http://localhost:9000/tables/events/size

# List segments
curl http://localhost:9000/segments/events

Advanced Usage

Star-Tree Index

{
  "starTreeIndexConfigs": [
    {
      "dimensionsSplitOrder": ["country", "event_type", "device"],
      "skipStarNodeCreationForDimensions": ["device"],
      "functionColumnPairs": [
        "COUNT__*",
        "SUM__revenue",
        "AVG__duration_ms",
        "DISTINCTCOUNTHLL__user_id"
      ],
      "maxLeafRecords": 10000
    }
  ]
}

Upsert Tables

{
  "tableName": "user_profiles_REALTIME",
  "tableType": "REALTIME",
  "upsertConfig": {
    "mode": "FULL",
    "comparisonColumns": ["updated_at"],
    "deleteRecordColumn": "is_deleted",
    "hashFunction": "MURMUR3",
    "enableSnapshot": true
  }
}

Batch Ingestion Job Spec

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: 's3://bucket/data/events/'
outputDirURI: 's3://bucket/pinot/segments/'
pinotFSSpecs:
  - scheme: s3
    className: org.apache.pinot.plugin.filesystem.S3PinotFS
    configs:
      region: us-east-1
recordReaderSpec:
  dataFormat: 'parquet'
  className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
tableSpec:
  tableName: 'events'
pinotClusterSpecs:
  - controllerURI: 'http://pinot-controller:9000'

Troubleshooting

Issue	Solution
High query latency	Add inverted/range indexes on filtered columns. Use star-tree for common aggregations
Kafka ingestion lag	Increase `realtime.segment.flush.threshold.rows`. Add more server instances
Segments not loading	Check controller logs. Verify deep storage access and server disk space
OOM on servers	Increase heap or switch from HEAP to MMAP `loadMode`. Reduce segment size
Schema evolution issues	Add new columns to schema first, then update table config. Backfill with reload
Upsert memory pressure	Enable snapshot for upsert tables. Monitor primary key cardinality
Query routing errors	Check broker health. Verify routing table in ZooKeeper
Segment download failures	Check deep storage (S3/HDFS) connectivity and permissions
Rebalance stuck	Check controller logs. Manually trigger segment reload if needed
Multi-value column performance	Limit multi-value cardinality. Avoid GROUP BY on multi-value columns