Trino Cheat Sheet
Overview
Trino (formerly PrestoSQL) is a distributed SQL query engine designed for fast, interactive analytic queries against data of any size. It was created at Facebook as Presto and later forked by the original creators to continue development as Trino. It can query data from multiple heterogeneous sources in a single query, including data lakes (S3/HDFS), relational databases (PostgreSQL, MySQL), NoSQL stores (Cassandra, MongoDB), and cloud data warehouses.
Trino uses a massively parallel processing (MPP) architecture with a coordinator node that parses queries and plans execution, and worker nodes that process data. It does not store data itself but instead connects to data sources through connectors. Trino excels at interactive ad-hoc analytics, data lake querying with Hive/Iceberg/Delta Lake table formats, and federated queries that join data across different systems without ETL.
Installation
Standalone Deployment
# Download Trino server
wget https://repo1.maven.org/maven2/io/trino/trino-server/440/trino-server-440.tar.gz
tar -xzf trino-server-440.tar.gz
cd trino-server-440
# Create required config directories
mkdir -p etc/catalog
# Configure node properties
cat > etc/node.properties << 'EOF'
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/trino/data
EOF
# Configure JVM
cat > etc/jvm.config << 'EOF'
-server
-Xmx16G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
EOF
# Start Trino
bin/launcher start
Docker Deployment
docker run -d \
--name trino \
-p 8080:8080 \
-v $(pwd)/etc:/etc/trino \
trinodb/trino:440
Trino CLI
# Download CLI
wget https://repo1.maven.org/maven2/io/trino/trino-cli/440/trino-cli-440-executable.jar
chmod +x trino-cli-440-executable.jar
mv trino-cli-440-executable.jar /usr/local/bin/trino
# Connect to Trino
trino --server http://localhost:8080 --catalog hive --schema default
Core SQL Commands
| Command | Description |
|---|---|
SHOW CATALOGS | List available catalogs |
SHOW SCHEMAS FROM catalog | List schemas in a catalog |
SHOW TABLES FROM catalog.schema | List tables in a schema |
DESCRIBE catalog.schema.table | Show table columns and types |
SHOW COLUMNS FROM table | Show column details |
SHOW CREATE TABLE table | Show DDL for a table |
SHOW FUNCTIONS | List available functions |
SHOW SESSION | Show current session properties |
SET SESSION property = value | Set session property |
EXPLAIN query | Show query execution plan |
EXPLAIN ANALYZE query | Show plan with runtime stats |
USE catalog.schema | Set default catalog and schema |
Query Examples
-- Cross-catalog federated query
SELECT
o.order_id,
c.customer_name,
p.product_name
FROM hive.warehouse.orders o
JOIN postgresql.public.customers c ON o.customer_id = c.id
JOIN mysql.inventory.products p ON o.product_id = p.id
WHERE o.order_date >= DATE '2024-01-01';
-- Window functions
SELECT
region,
month,
revenue,
SUM(revenue) OVER (PARTITION BY region ORDER BY month) AS running_total,
LAG(revenue) OVER (PARTITION BY region ORDER BY month) AS prev_month,
RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS rank
FROM hive.analytics.monthly_revenue;
-- Array and map operations
SELECT
id,
TRANSFORM(tags, t -> UPPER(t)) AS upper_tags,
CARDINALITY(items) AS item_count,
element_at(metadata, 'source') AS data_source
FROM hive.raw.events;
-- Approximate aggregations
SELECT
region,
approx_distinct(user_id) AS unique_users,
approx_percentile(latency, 0.99) AS p99_latency
FROM hive.logs.requests
GROUP BY region;
Configuration
Coordinator Config (etc/config.properties)
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=8GB
query.max-total-memory-per-node=10GB
discovery.uri=http://coordinator:8080
Worker Config
coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=8GB
discovery.uri=http://coordinator:8080
Catalog Configurations
# etc/catalog/hive.properties (Hive/S3 data lake)
connector.name=hive
hive.metastore.uri=thrift://hive-metastore:9083
hive.s3.aws-access-key=${ENV:AWS_ACCESS_KEY_ID}
hive.s3.aws-secret-key=${ENV:AWS_SECRET_ACCESS_KEY}
hive.s3.endpoint=https://s3.amazonaws.com
hive.allow-drop-table=true
hive.non-managed-table-writes-enabled=true
# etc/catalog/iceberg.properties
connector.name=iceberg
iceberg.catalog.type=hive_metastore
hive.metastore.uri=thrift://hive-metastore:9083
iceberg.file-format=PARQUET
# etc/catalog/postgresql.properties
connector.name=postgresql
connection-url=jdbc:postgresql://host:5432/mydb
connection-user=trino_user
connection-password=trino_pass
# etc/catalog/mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://host:3306
connection-user=trino_user
connection-password=trino_pass
# etc/catalog/mongodb.properties
connector.name=mongodb
mongodb.connection-url=mongodb://host:27017/
mongodb.schema-collection=_schema
Advanced Usage
Performance Tuning
-- Session-level tuning
SET SESSION query_max_run_time = '2h';
SET SESSION join_distribution_type = 'PARTITIONED';
SET SESSION task_concurrency = 16;
-- Cost-based optimizer hints
SET SESSION join_reordering_strategy = 'AUTOMATIC';
SET SESSION enable_stats_calculator = true;
-- Analyze tables for optimizer statistics
ANALYZE hive.warehouse.orders;
-- Partition pruning (use partition columns in WHERE)
SELECT * FROM hive.warehouse.orders
WHERE order_date = DATE '2024-06-15' -- partition column
AND status = 'completed';
Materialized Views
-- Create materialized view (Iceberg connector)
CREATE MATERIALIZED VIEW iceberg.analytics.daily_summary AS
SELECT
order_date,
region,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue
FROM iceberg.warehouse.orders
GROUP BY order_date, region;
-- Refresh materialized view
REFRESH MATERIALIZED VIEW iceberg.analytics.daily_summary;
Table Functions and Dynamic Filtering
-- Query function (read raw files)
SELECT * FROM TABLE(
hive.system.sync_partition_metadata(
schema_name => 'warehouse',
table_name => 'orders',
mode => 'FULL'
)
);
-- Prepared statements
PREPARE my_query FROM
SELECT * FROM orders WHERE region = ? AND amount > ?;
EXECUTE my_query USING 'us-east', 100.0;
Security Configuration
# etc/config.properties (add to coordinator)
http-server.authentication.type=PASSWORD
http-server.https.enabled=true
http-server.https.port=8443
http-server.https.keystore.path=/etc/trino/keystore.jks
http-server.https.keystore.key=keystorepass
# etc/password-authenticator.properties
password-authenticator.name=ldap
ldap.url=ldaps://ldap:636
ldap.user-bind-pattern=uid=${USER},ou=people,dc=example,dc=com
Troubleshooting
| Issue | Solution |
|---|---|
| Query killed: exceeded memory | Increase query.max-memory-per-node. Optimize query to reduce memory usage |
| Slow queries on S3 | Enable hive.s3.streaming and increase hive.s3.max-connections. Use Parquet format |
| Coordinator OOM | Increase coordinator heap. Reduce concurrent queries with query.max-queued-queries |
| Worker disconnection | Check network stability. Increase discovery.uri timeout settings |
| Catalog connection failures | Verify JDBC URLs and credentials. Check firewall rules |
| Partition listing slow | Use hive.metastore-cache-ttl. Switch to Iceberg for large partition counts |
| Join performance poor | Use EXPLAIN to check join strategy. Force broadcast with session properties |
| Query planning timeout | Simplify query or break into CTEs. Increase query.max-planning-time |
| Type mismatch in federated queries | Use explicit CAST() between different catalog types |
| Skewed data causing stragglers | Use SET SESSION scale_writers = true for write operations |