تخطَّ إلى المحتوى

Apache Pinot Cheat Sheet

Overview

Apache Pinot is a real-time distributed OLAP datastore designed to deliver ultra-low-latency analytics at high throughput. Originally built at LinkedIn to power analytics dashboards that serve millions of users, Pinot is optimized for serving analytical queries that need sub-second response times even at massive scale. It combines the best of columnar databases and search engines with custom indexing strategies.

Pinot supports both real-time ingestion from streaming platforms (Kafka, Kinesis) and batch ingestion from data lakes (HDFS, S3). Its architecture separates concerns into controllers (cluster management), brokers (query routing), servers (data storage and query execution), and minions (background tasks). Pinot provides a SQL interface, automatic segment management, multi-value columns, text search, geospatial queries, and pluggable index types including inverted, star-tree, range, and bloom filter indexes.

Installation

Docker Quickstart

# Quick start with Docker
docker run -d \
    --name pinot-quickstart \
    -p 9000:9000 \
    -p 8000:8000 \
    apachepinot/pinot:1.1.0 QuickStart \
    -type batch

# Access Pinot Query Console at http://localhost:9000

Docker Compose (Full Cluster)

version: '3.8'
services:
  zookeeper:
    image: zookeeper:3.9
    ports: ["2181:2181"]

  pinot-controller:
    image: apachepinot/pinot:1.1.0
    command: StartController -zkAddress zookeeper:2181
    ports: ["9000:9000"]
    depends_on: [zookeeper]

  pinot-broker:
    image: apachepinot/pinot:1.1.0
    command: StartBroker -zkAddress zookeeper:2181
    ports: ["8099:8099"]
    depends_on: [pinot-controller]

  pinot-server:
    image: apachepinot/pinot:1.1.0
    command: StartServer -zkAddress zookeeper:2181
    ports: ["8098:8098"]
    depends_on: [pinot-controller]

  pinot-minion:
    image: apachepinot/pinot:1.1.0
    command: StartMinion -zkAddress zookeeper:2181
    depends_on: [pinot-controller]

Standalone Installation

# Download Pinot
wget https://downloads.apache.org/pinot/apache-pinot-1.1.0/apache-pinot-1.1.0-bin.tar.gz
tar -xzf apache-pinot-1.1.0-bin.tar.gz
cd apache-pinot-1.1.0-bin

# Start all components
bin/pinot-admin.sh QuickStart -type batch

Schema and Table Configuration

Schema Definition

{
  "schemaName": "events",
  "dimensionFieldSpecs": [
    {"name": "event_id", "dataType": "STRING"},
    {"name": "user_id", "dataType": "STRING"},
    {"name": "event_type", "dataType": "STRING"},
    {"name": "country", "dataType": "STRING"},
    {"name": "device", "dataType": "STRING"},
    {"name": "tags", "dataType": "STRING", "singleValueField": false}
  ],
  "metricFieldSpecs": [
    {"name": "revenue", "dataType": "DOUBLE"},
    {"name": "duration_ms", "dataType": "LONG"},
    {"name": "item_count", "dataType": "INT"}
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "event_time",
      "dataType": "TIMESTAMP",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ],
  "primaryKeyColumns": ["event_id"]
}

Realtime Table Config (Kafka)

{
  "tableName": "events_REALTIME",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "event_time",
    "schemaName": "events",
    "replication": "2",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "30",
    "segmentPushType": "APPEND"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": ["event_type", "country"],
    "rangeIndexColumns": ["revenue", "duration_ms"],
    "bloomFilterColumns": ["user_id"],
    "noDictionaryColumns": ["event_id"],
    "sortedColumn": ["event_time"],
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "events",
      "stream.kafka.broker.list": "kafka:9092",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "realtime.segment.flush.threshold.rows": "100000",
      "realtime.segment.flush.threshold.time": "1h"
    }
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "metadata": {}
}

Offline Table Config

{
  "tableName": "events_OFFLINE",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "timeColumnName": "event_time",
    "schemaName": "events",
    "replication": "2",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "365"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": ["event_type", "country"],
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": ["country", "event_type"],
        "functionColumnPairs": ["COUNT__*", "SUM__revenue"],
        "maxLeafRecords": 10000
      }
    ]
  }
}

SQL Queries

-- Basic aggregation
SELECT
    country,
    COUNT(*) AS event_count,
    SUM(revenue) AS total_revenue,
    AVG(duration_ms) AS avg_duration
FROM events
WHERE event_time > ago('P7D')
GROUP BY country
ORDER BY total_revenue DESC
LIMIT 20;

-- Time-series query
SELECT
    DATETIMECONVERT(event_time, '1:MILLISECONDS:EPOCH', '1:MINUTES:EPOCH', '5:MINUTES') AS time_bucket,
    COUNT(*) AS requests,
    PERCENTILEEST(duration_ms, 99) AS p99_latency
FROM events
WHERE event_time > ago('PT1H')
GROUP BY time_bucket
ORDER BY time_bucket;

-- Multi-value column query
SELECT
    tags,
    COUNT(*) AS count
FROM events
WHERE tags IN ('premium', 'returning')
  AND event_time > ago('P1D')
GROUP BY tags
ORDER BY count DESC;

-- Distinct count with HyperLogLog
SELECT
    country,
    DISTINCTCOUNTHLL(user_id) AS unique_users
FROM events
WHERE event_time > ago('P30D')
GROUP BY country;

-- Text search
SELECT event_id, description
FROM support_tickets
WHERE TEXT_MATCH(description, 'payment AND (failed OR error)')
LIMIT 50;

CLI and API Commands

CommandDescription
bin/pinot-admin.sh AddTable -tableConfigFile config.json -schemaFile schema.jsonCreate table with schema
bin/pinot-admin.sh AddSchema -schemaFile schema.jsonAdd schema
bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile job.yamlRun batch ingestion
bin/pinot-admin.sh PostQuery -brokerPort 8099 -query "SELECT ..."Run query from CLI
bin/pinot-admin.sh ChangeTableState -tableName t -state enableEnable/disable table
bin/pinot-admin.sh DeleteTable -tableName tDelete a table
bin/pinot-admin.sh RebalanceTable -tableName tRebalance segments
# REST API examples
# Create schema
curl -X POST http://localhost:9000/schemas -H "Content-Type: application/json" -d @schema.json

# Create table
curl -X POST http://localhost:9000/tables -H "Content-Type: application/json" -d @table-config.json

# Query via broker
curl -X POST http://localhost:8099/query/sql -H "Content-Type: application/json" \
    -d '{"sql": "SELECT COUNT(*) FROM events"}'

# Get table size
curl http://localhost:9000/tables/events/size

# List segments
curl http://localhost:9000/segments/events

Advanced Usage

Star-Tree Index

{
  "starTreeIndexConfigs": [
    {
      "dimensionsSplitOrder": ["country", "event_type", "device"],
      "skipStarNodeCreationForDimensions": ["device"],
      "functionColumnPairs": [
        "COUNT__*",
        "SUM__revenue",
        "AVG__duration_ms",
        "DISTINCTCOUNTHLL__user_id"
      ],
      "maxLeafRecords": 10000
    }
  ]
}

Upsert Tables

{
  "tableName": "user_profiles_REALTIME",
  "tableType": "REALTIME",
  "upsertConfig": {
    "mode": "FULL",
    "comparisonColumns": ["updated_at"],
    "deleteRecordColumn": "is_deleted",
    "hashFunction": "MURMUR3",
    "enableSnapshot": true
  }
}

Batch Ingestion Job Spec

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: 's3://bucket/data/events/'
outputDirURI: 's3://bucket/pinot/segments/'
pinotFSSpecs:
  - scheme: s3
    className: org.apache.pinot.plugin.filesystem.S3PinotFS
    configs:
      region: us-east-1
recordReaderSpec:
  dataFormat: 'parquet'
  className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
tableSpec:
  tableName: 'events'
pinotClusterSpecs:
  - controllerURI: 'http://pinot-controller:9000'

Troubleshooting

IssueSolution
High query latencyAdd inverted/range indexes on filtered columns. Use star-tree for common aggregations
Kafka ingestion lagIncrease realtime.segment.flush.threshold.rows. Add more server instances
Segments not loadingCheck controller logs. Verify deep storage access and server disk space
OOM on serversIncrease heap or switch from HEAP to MMAP loadMode. Reduce segment size
Schema evolution issuesAdd new columns to schema first, then update table config. Backfill with reload
Upsert memory pressureEnable snapshot for upsert tables. Monitor primary key cardinality
Query routing errorsCheck broker health. Verify routing table in ZooKeeper
Segment download failuresCheck deep storage (S3/HDFS) connectivity and permissions
Rebalance stuckCheck controller logs. Manually trigger segment reload if needed
Multi-value column performanceLimit multi-value cardinality. Avoid GROUP BY on multi-value columns