Skip to content

Trino Cheat Sheet

Overview

Trino (formerly PrestoSQL) is a distributed SQL query engine designed for fast, interactive analytic queries against data of any size. It was created at Facebook as Presto and later forked by the original creators to continue development as Trino. It can query data from multiple heterogeneous sources in a single query, including data lakes (S3/HDFS), relational databases (PostgreSQL, MySQL), NoSQL stores (Cassandra, MongoDB), and cloud data warehouses.

Trino uses a massively parallel processing (MPP) architecture with a coordinator node that parses queries and plans execution, and worker nodes that process data. It does not store data itself but instead connects to data sources through connectors. Trino excels at interactive ad-hoc analytics, data lake querying with Hive/Iceberg/Delta Lake table formats, and federated queries that join data across different systems without ETL.

Installation

Standalone Deployment

# Download Trino server
wget https://repo1.maven.org/maven2/io/trino/trino-server/440/trino-server-440.tar.gz
tar -xzf trino-server-440.tar.gz
cd trino-server-440

# Create required config directories
mkdir -p etc/catalog

# Configure node properties
cat > etc/node.properties << 'EOF'
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/trino/data
EOF

# Configure JVM
cat > etc/jvm.config << 'EOF'
-server
-Xmx16G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
EOF

# Start Trino
bin/launcher start

Docker Deployment

docker run -d \
    --name trino \
    -p 8080:8080 \
    -v $(pwd)/etc:/etc/trino \
    trinodb/trino:440

Trino CLI

# Download CLI
wget https://repo1.maven.org/maven2/io/trino/trino-cli/440/trino-cli-440-executable.jar
chmod +x trino-cli-440-executable.jar
mv trino-cli-440-executable.jar /usr/local/bin/trino

# Connect to Trino
trino --server http://localhost:8080 --catalog hive --schema default

Core SQL Commands

CommandDescription
SHOW CATALOGSList available catalogs
SHOW SCHEMAS FROM catalogList schemas in a catalog
SHOW TABLES FROM catalog.schemaList tables in a schema
DESCRIBE catalog.schema.tableShow table columns and types
SHOW COLUMNS FROM tableShow column details
SHOW CREATE TABLE tableShow DDL for a table
SHOW FUNCTIONSList available functions
SHOW SESSIONShow current session properties
SET SESSION property = valueSet session property
EXPLAIN queryShow query execution plan
EXPLAIN ANALYZE queryShow plan with runtime stats
USE catalog.schemaSet default catalog and schema

Query Examples

-- Cross-catalog federated query
SELECT
    o.order_id,
    c.customer_name,
    p.product_name
FROM hive.warehouse.orders o
JOIN postgresql.public.customers c ON o.customer_id = c.id
JOIN mysql.inventory.products p ON o.product_id = p.id
WHERE o.order_date >= DATE '2024-01-01';

-- Window functions
SELECT
    region,
    month,
    revenue,
    SUM(revenue) OVER (PARTITION BY region ORDER BY month) AS running_total,
    LAG(revenue) OVER (PARTITION BY region ORDER BY month) AS prev_month,
    RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS rank
FROM hive.analytics.monthly_revenue;

-- Array and map operations
SELECT
    id,
    TRANSFORM(tags, t -> UPPER(t)) AS upper_tags,
    CARDINALITY(items) AS item_count,
    element_at(metadata, 'source') AS data_source
FROM hive.raw.events;

-- Approximate aggregations
SELECT
    region,
    approx_distinct(user_id) AS unique_users,
    approx_percentile(latency, 0.99) AS p99_latency
FROM hive.logs.requests
GROUP BY region;

Configuration

Coordinator Config (etc/config.properties)

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=8GB
query.max-total-memory-per-node=10GB
discovery.uri=http://coordinator:8080

Worker Config

coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=8GB
discovery.uri=http://coordinator:8080

Catalog Configurations

# etc/catalog/hive.properties (Hive/S3 data lake)
connector.name=hive
hive.metastore.uri=thrift://hive-metastore:9083
hive.s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
hive.s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
hive.s3.endpoint=https://s3.amazonaws.com
hive.allow-drop-table=true
hive.non-managed-table-writes-enabled=true

# etc/catalog/iceberg.properties
connector.name=iceberg
iceberg.catalog.type=hive_metastore
hive.metastore.uri=thrift://hive-metastore:9083
iceberg.file-format=PARQUET

# etc/catalog/postgresql.properties
connector.name=postgresql
connection-url=jdbc:postgresql://host:5432/mydb
connection-user=trino_user
connection-password=trino_pass

# etc/catalog/mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://host:3306
connection-user=trino_user
connection-password=trino_pass

# etc/catalog/mongodb.properties
connector.name=mongodb
mongodb.connection-url=mongodb://host:27017/
mongodb.schema-collection=_schema

Advanced Usage

Performance Tuning

-- Session-level tuning
SET SESSION query_max_run_time = '2h';
SET SESSION join_distribution_type = 'PARTITIONED';
SET SESSION task_concurrency = 16;

-- Cost-based optimizer hints
SET SESSION join_reordering_strategy = 'AUTOMATIC';
SET SESSION enable_stats_calculator = true;

-- Analyze tables for optimizer statistics
ANALYZE hive.warehouse.orders;

-- Partition pruning (use partition columns in WHERE)
SELECT * FROM hive.warehouse.orders
WHERE order_date = DATE '2024-06-15'  -- partition column
  AND status = 'completed';

Materialized Views

-- Create materialized view (Iceberg connector)
CREATE MATERIALIZED VIEW iceberg.analytics.daily_summary AS
SELECT
    order_date,
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue
FROM iceberg.warehouse.orders
GROUP BY order_date, region;

-- Refresh materialized view
REFRESH MATERIALIZED VIEW iceberg.analytics.daily_summary;

Table Functions and Dynamic Filtering

-- Query function (read raw files)
SELECT * FROM TABLE(
    hive.system.sync_partition_metadata(
        schema_name => 'warehouse',
        table_name => 'orders',
        mode => 'FULL'
    )
);

-- Prepared statements
PREPARE my_query FROM
    SELECT * FROM orders WHERE region = ? AND amount > ?;
EXECUTE my_query USING 'us-east', 100.0;

Security Configuration

# etc/config.properties (add to coordinator)
http-server.authentication.type=PASSWORD
http-server.https.enabled=true
http-server.https.port=8443
http-server.https.keystore.path=/etc/trino/keystore.jks
http-server.https.keystore.key=keystorepass

# etc/password-authenticator.properties
password-authenticator.name=ldap
ldap.url=ldaps://ldap:636
ldap.user-bind-pattern=uid=${USER},ou=people,dc=example,dc=com

Troubleshooting

IssueSolution
Query killed: exceeded memoryIncrease query.max-memory-per-node. Optimize query to reduce memory usage
Slow queries on S3Enable hive.s3.streaming and increase hive.s3.max-connections. Use Parquet format
Coordinator OOMIncrease coordinator heap. Reduce concurrent queries with query.max-queued-queries
Worker disconnectionCheck network stability. Increase discovery.uri timeout settings
Catalog connection failuresVerify JDBC URLs and credentials. Check firewall rules
Partition listing slowUse hive.metastore-cache-ttl. Switch to Iceberg for large partition counts
Join performance poorUse EXPLAIN to check join strategy. Force broadcast with session properties
Query planning timeoutSimplify query or break into CTEs. Increase query.max-planning-time
Type mismatch in federated queriesUse explicit CAST() between different catalog types
Skewed data causing stragglersUse SET SESSION scale_writers = true for write operations