Trino Cheat Sheet

Overview

Trino (formerly PrestoSQL) is a distributed SQL query engine designed for fast, interactive analytic queries against data of any size. It was created at Facebook as Presto and later forked by the original creators to continue development as Trino. It can query data from multiple heterogeneous sources in a single query, including data lakes (S3/HDFS), relational databases (PostgreSQL, MySQL), NoSQL stores (Cassandra, MongoDB), and cloud data warehouses.

Trino uses a massively parallel processing (MPP) architecture with a coordinator node that parses queries and plans execution, and worker nodes that process data. It does not store data itself but instead connects to data sources through connectors. Trino excels at interactive ad-hoc analytics, data lake querying with Hive/Iceberg/Delta Lake table formats, and federated queries that join data across different systems without ETL.

Installation

Standalone Deployment

# Download Trino server
wget https://repo1.maven.org/maven2/io/trino/trino-server/440/trino-server-440.tar.gz
tar -xzf trino-server-440.tar.gz
cd trino-server-440

# Create required config directories
mkdir -p etc/catalog

# Configure node properties
cat > etc/node.properties << 'EOF'
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/trino/data
EOF

# Configure JVM
cat > etc/jvm.config << 'EOF'
-server
-Xmx16G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
EOF

# Start Trino
bin/launcher start

Docker Deployment

docker run -d \
    --name trino \
    -p 8080:8080 \
    -v $(pwd)/etc:/etc/trino \
    trinodb/trino:440

Trino CLI

# Download CLI
wget https://repo1.maven.org/maven2/io/trino/trino-cli/440/trino-cli-440-executable.jar
chmod +x trino-cli-440-executable.jar
mv trino-cli-440-executable.jar /usr/local/bin/trino

# Connect to Trino
trino --server http://localhost:8080 --catalog hive --schema default

Core SQL Commands

Command	Description
`SHOW CATALOGS`	List available catalogs
`SHOW SCHEMAS FROM catalog`	List schemas in a catalog
`SHOW TABLES FROM catalog.schema`	List tables in a schema
`DESCRIBE catalog.schema.table`	Show table columns and types
`SHOW COLUMNS FROM table`	Show column details
`SHOW CREATE TABLE table`	Show DDL for a table
`SHOW FUNCTIONS`	List available functions
`SHOW SESSION`	Show current session properties
`SET SESSION property = value`	Set session property
`EXPLAIN query`	Show query execution plan
`EXPLAIN ANALYZE query`	Show plan with runtime stats
`USE catalog.schema`	Set default catalog and schema

Query Examples

-- Cross-catalog federated query
SELECT
    o.order_id,
    c.customer_name,
    p.product_name
FROM hive.warehouse.orders o
JOIN postgresql.public.customers c ON o.customer_id = c.id
JOIN mysql.inventory.products p ON o.product_id = p.id
WHERE o.order_date >= DATE '2024-01-01';

-- Window functions
SELECT
    region,
    month,
    revenue,
    SUM(revenue) OVER (PARTITION BY region ORDER BY month) AS running_total,
    LAG(revenue) OVER (PARTITION BY region ORDER BY month) AS prev_month,
    RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS rank
FROM hive.analytics.monthly_revenue;

-- Array and map operations
SELECT
    id,
    TRANSFORM(tags, t -> UPPER(t)) AS upper_tags,
    CARDINALITY(items) AS item_count,
    element_at(metadata, 'source') AS data_source
FROM hive.raw.events;

-- Approximate aggregations
SELECT
    region,
    approx_distinct(user_id) AS unique_users,
    approx_percentile(latency, 0.99) AS p99_latency
FROM hive.logs.requests
GROUP BY region;

Configuration

Coordinator Config (etc/config.properties)

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=8GB
query.max-total-memory-per-node=10GB
discovery.uri=http://coordinator:8080

Worker Config

coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=8GB
discovery.uri=http://coordinator:8080

Catalog Configurations

# etc/catalog/hive.properties (Hive/S3 data lake)
connector.name=hive
hive.metastore.uri=thrift://hive-metastore:9083
hive.s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
hive.s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
hive.s3.endpoint=https://s3.amazonaws.com
hive.allow-drop-table=true
hive.non-managed-table-writes-enabled=true

# etc/catalog/iceberg.properties
connector.name=iceberg
iceberg.catalog.type=hive_metastore
hive.metastore.uri=thrift://hive-metastore:9083
iceberg.file-format=PARQUET

# etc/catalog/postgresql.properties
connector.name=postgresql
connection-url=jdbc:postgresql://host:5432/mydb
connection-user=trino_user
connection-password=trino_pass

# etc/catalog/mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://host:3306
connection-user=trino_user
connection-password=trino_pass

# etc/catalog/mongodb.properties
connector.name=mongodb
mongodb.connection-url=mongodb://host:27017/
mongodb.schema-collection=_schema

Advanced Usage

Performance Tuning

-- Session-level tuning
SET SESSION query_max_run_time = '2h';
SET SESSION join_distribution_type = 'PARTITIONED';
SET SESSION task_concurrency = 16;

-- Cost-based optimizer hints
SET SESSION join_reordering_strategy = 'AUTOMATIC';
SET SESSION enable_stats_calculator = true;

-- Analyze tables for optimizer statistics
ANALYZE hive.warehouse.orders;

-- Partition pruning (use partition columns in WHERE)
SELECT * FROM hive.warehouse.orders
WHERE order_date = DATE '2024-06-15'  -- partition column
  AND status = 'completed';

Materialized Views

-- Create materialized view (Iceberg connector)
CREATE MATERIALIZED VIEW iceberg.analytics.daily_summary AS
SELECT
    order_date,
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue
FROM iceberg.warehouse.orders
GROUP BY order_date, region;

-- Refresh materialized view
REFRESH MATERIALIZED VIEW iceberg.analytics.daily_summary;

Table Functions and Dynamic Filtering

-- Query function (read raw files)
SELECT * FROM TABLE(
    hive.system.sync_partition_metadata(
        schema_name => 'warehouse',
        table_name => 'orders',
        mode => 'FULL'
    )
);

-- Prepared statements
PREPARE my_query FROM
    SELECT * FROM orders WHERE region = ? AND amount > ?;
EXECUTE my_query USING 'us-east', 100.0;

Security Configuration

# etc/config.properties (add to coordinator)
http-server.authentication.type=PASSWORD
http-server.https.enabled=true
http-server.https.port=8443
http-server.https.keystore.path=/etc/trino/keystore.jks
http-server.https.keystore.key=keystorepass

# etc/password-authenticator.properties
password-authenticator.name=ldap
ldap.url=ldaps://ldap:636
ldap.user-bind-pattern=uid=${USER},ou=people,dc=example,dc=com

Troubleshooting

Issue	Solution
Query killed: exceeded memory	Increase `query.max-memory-per-node`. Optimize query to reduce memory usage
Slow queries on S3	Enable `hive.s3.streaming` and increase `hive.s3.max-connections`. Use Parquet format
Coordinator OOM	Increase coordinator heap. Reduce concurrent queries with `query.max-queued-queries`
Worker disconnection	Check network stability. Increase `discovery.uri` timeout settings
Catalog connection failures	Verify JDBC URLs and credentials. Check firewall rules
Partition listing slow	Use `hive.metastore-cache-ttl`. Switch to Iceberg for large partition counts
Join performance poor	Use `EXPLAIN` to check join strategy. Force broadcast with session properties
Query planning timeout	Simplify query or break into CTEs. Increase `query.max-planning-time`
Type mismatch in federated queries	Use explicit CAST() between different catalog types
Skewed data causing stragglers	Use `SET SESSION scale_writers = true` for write operations