Apache Pinot Cheat Sheet
Overview
Apache Pinot is a real-time distributed OLAP datastore designed to deliver ultra-low-latency analytics at high throughput. Originally built at LinkedIn to power analytics dashboards that serve millions of users, Pinot is optimized for serving analytical queries that need sub-second response times even at massive scale. It combines the best of columnar databases and search engines with custom indexing strategies.
Pinot supports both real-time ingestion from streaming platforms (Kafka, Kinesis) and batch ingestion from data lakes (HDFS, S3). Its architecture separates concerns into controllers (cluster management), brokers (query routing), servers (data storage and query execution), and minions (background tasks). Pinot provides a SQL interface, automatic segment management, multi-value columns, text search, geospatial queries, and pluggable index types including inverted, star-tree, range, and bloom filter indexes.
Installation
Docker Quickstart
# Quick start with Docker
docker run -d \
--name pinot-quickstart \
-p 9000:9000 \
-p 8000:8000 \
apachepinot/pinot:1.1.0 QuickStart \
-type batch
# Access Pinot Query Console at http://localhost:9000
Docker Compose (Full Cluster)
version: '3.8'
services:
zookeeper:
image: zookeeper:3.9
ports: ["2181:2181"]
pinot-controller:
image: apachepinot/pinot:1.1.0
command: StartController -zkAddress zookeeper:2181
ports: ["9000:9000"]
depends_on: [zookeeper]
pinot-broker:
image: apachepinot/pinot:1.1.0
command: StartBroker -zkAddress zookeeper:2181
ports: ["8099:8099"]
depends_on: [pinot-controller]
pinot-server:
image: apachepinot/pinot:1.1.0
command: StartServer -zkAddress zookeeper:2181
ports: ["8098:8098"]
depends_on: [pinot-controller]
pinot-minion:
image: apachepinot/pinot:1.1.0
command: StartMinion -zkAddress zookeeper:2181
depends_on: [pinot-controller]
Standalone Installation
# Download Pinot
wget https://downloads.apache.org/pinot/apache-pinot-1.1.0/apache-pinot-1.1.0-bin.tar.gz
tar -xzf apache-pinot-1.1.0-bin.tar.gz
cd apache-pinot-1.1.0-bin
# Start all components
bin/pinot-admin.sh QuickStart -type batch
Schema and Table Configuration
Schema Definition
{
"schemaName": "events",
"dimensionFieldSpecs": [
{"name": "event_id", "dataType": "STRING"},
{"name": "user_id", "dataType": "STRING"},
{"name": "event_type", "dataType": "STRING"},
{"name": "country", "dataType": "STRING"},
{"name": "device", "dataType": "STRING"},
{"name": "tags", "dataType": "STRING", "singleValueField": false}
],
"metricFieldSpecs": [
{"name": "revenue", "dataType": "DOUBLE"},
{"name": "duration_ms", "dataType": "LONG"},
{"name": "item_count", "dataType": "INT"}
],
"dateTimeFieldSpecs": [
{
"name": "event_time",
"dataType": "TIMESTAMP",
"format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}
],
"primaryKeyColumns": ["event_id"]
}
Realtime Table Config (Kafka)
{
"tableName": "events_REALTIME",
"tableType": "REALTIME",
"segmentsConfig": {
"timeColumnName": "event_time",
"schemaName": "events",
"replication": "2",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "30",
"segmentPushType": "APPEND"
},
"tableIndexConfig": {
"loadMode": "MMAP",
"invertedIndexColumns": ["event_type", "country"],
"rangeIndexColumns": ["revenue", "duration_ms"],
"bloomFilterColumns": ["user_id"],
"noDictionaryColumns": ["event_id"],
"sortedColumn": ["event_time"],
"streamConfigs": {
"streamType": "kafka",
"stream.kafka.topic.name": "events",
"stream.kafka.broker.list": "kafka:9092",
"stream.kafka.consumer.type": "lowlevel",
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"realtime.segment.flush.threshold.rows": "100000",
"realtime.segment.flush.threshold.time": "1h"
}
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"metadata": {}
}
Offline Table Config
{
"tableName": "events_OFFLINE",
"tableType": "OFFLINE",
"segmentsConfig": {
"timeColumnName": "event_time",
"schemaName": "events",
"replication": "2",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "365"
},
"tableIndexConfig": {
"loadMode": "MMAP",
"invertedIndexColumns": ["event_type", "country"],
"starTreeIndexConfigs": [
{
"dimensionsSplitOrder": ["country", "event_type"],
"functionColumnPairs": ["COUNT__*", "SUM__revenue"],
"maxLeafRecords": 10000
}
]
}
}
SQL Queries
-- Basic aggregation
SELECT
country,
COUNT(*) AS event_count,
SUM(revenue) AS total_revenue,
AVG(duration_ms) AS avg_duration
FROM events
WHERE event_time > ago('P7D')
GROUP BY country
ORDER BY total_revenue DESC
LIMIT 20;
-- Time-series query
SELECT
DATETIMECONVERT(event_time, '1:MILLISECONDS:EPOCH', '1:MINUTES:EPOCH', '5:MINUTES') AS time_bucket,
COUNT(*) AS requests,
PERCENTILEEST(duration_ms, 99) AS p99_latency
FROM events
WHERE event_time > ago('PT1H')
GROUP BY time_bucket
ORDER BY time_bucket;
-- Multi-value column query
SELECT
tags,
COUNT(*) AS count
FROM events
WHERE tags IN ('premium', 'returning')
AND event_time > ago('P1D')
GROUP BY tags
ORDER BY count DESC;
-- Distinct count with HyperLogLog
SELECT
country,
DISTINCTCOUNTHLL(user_id) AS unique_users
FROM events
WHERE event_time > ago('P30D')
GROUP BY country;
-- Text search
SELECT event_id, description
FROM support_tickets
WHERE TEXT_MATCH(description, 'payment AND (failed OR error)')
LIMIT 50;
CLI and API Commands
| Command | Description |
|---|---|
bin/pinot-admin.sh AddTable -tableConfigFile config.json -schemaFile schema.json | Create table with schema |
bin/pinot-admin.sh AddSchema -schemaFile schema.json | Add schema |
bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile job.yaml | Run batch ingestion |
bin/pinot-admin.sh PostQuery -brokerPort 8099 -query "SELECT ..." | Run query from CLI |
bin/pinot-admin.sh ChangeTableState -tableName t -state enable | Enable/disable table |
bin/pinot-admin.sh DeleteTable -tableName t | Delete a table |
bin/pinot-admin.sh RebalanceTable -tableName t | Rebalance segments |
# REST API examples
# Create schema
curl -X POST http://localhost:9000/schemas -H "Content-Type: application/json" -d @schema.json
# Create table
curl -X POST http://localhost:9000/tables -H "Content-Type: application/json" -d @table-config.json
# Query via broker
curl -X POST http://localhost:8099/query/sql -H "Content-Type: application/json" \
-d '{"sql": "SELECT COUNT(*) FROM events"}'
# Get table size
curl http://localhost:9000/tables/events/size
# List segments
curl http://localhost:9000/segments/events
Advanced Usage
Star-Tree Index
{
"starTreeIndexConfigs": [
{
"dimensionsSplitOrder": ["country", "event_type", "device"],
"skipStarNodeCreationForDimensions": ["device"],
"functionColumnPairs": [
"COUNT__*",
"SUM__revenue",
"AVG__duration_ms",
"DISTINCTCOUNTHLL__user_id"
],
"maxLeafRecords": 10000
}
]
}
Upsert Tables
{
"tableName": "user_profiles_REALTIME",
"tableType": "REALTIME",
"upsertConfig": {
"mode": "FULL",
"comparisonColumns": ["updated_at"],
"deleteRecordColumn": "is_deleted",
"hashFunction": "MURMUR3",
"enableSnapshot": true
}
}
Batch Ingestion Job Spec
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: 's3://bucket/data/events/'
outputDirURI: 's3://bucket/pinot/segments/'
pinotFSSpecs:
- scheme: s3
className: org.apache.pinot.plugin.filesystem.S3PinotFS
configs:
region: us-east-1
recordReaderSpec:
dataFormat: 'parquet'
className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
tableSpec:
tableName: 'events'
pinotClusterSpecs:
- controllerURI: 'http://pinot-controller:9000'
Troubleshooting
| Issue | Solution |
|---|---|
| High query latency | Add inverted/range indexes on filtered columns. Use star-tree for common aggregations |
| Kafka ingestion lag | Increase realtime.segment.flush.threshold.rows. Add more server instances |
| Segments not loading | Check controller logs. Verify deep storage access and server disk space |
| OOM on servers | Increase heap or switch from HEAP to MMAP loadMode. Reduce segment size |
| Schema evolution issues | Add new columns to schema first, then update table config. Backfill with reload |
| Upsert memory pressure | Enable snapshot for upsert tables. Monitor primary key cardinality |
| Query routing errors | Check broker health. Verify routing table in ZooKeeper |
| Segment download failures | Check deep storage (S3/HDFS) connectivity and permissions |
| Rebalance stuck | Check controller logs. Manually trigger segment reload if needed |
| Multi-value column performance | Limit multi-value cardinality. Avoid GROUP BY on multi-value columns |