Zum Inhalt springen

Meltano Cheat Sheet

Overview

Meltano is an open-source platform for the full data lifecycle: Extract, Load, Transform, and Orchestrate. Built by GitLab’s data team and now maintained by the Meltano community, it uses the Singer protocol for data integration (taps for extraction, targets for loading) and integrates with dbt for transformations and Airflow for orchestration. Meltano manages everything through a declarative YAML configuration and a powerful CLI.

Meltano provides a plugin-based architecture with access to hundreds of Singer taps and targets from the MeltanoHub registry. It handles virtual environment management, configuration, secrets, and pipeline execution. Projects are version-controlled, making it easy to manage data infrastructure as code. Meltano supports environments (dev, staging, prod), scheduled pipelines, and can be deployed to any infrastructure including Docker, Kubernetes, and cloud platforms.

Installation

# Install Meltano
pip install meltano

# Or with pipx (recommended for isolation)
pipx install meltano

# Initialize a new project
meltano init my_data_project
cd my_data_project

# Verify installation
meltano version

# Start the web UI (optional)
meltano ui

Docker Installation

# Use official Docker image
docker run -v $(pwd):/project -w /project meltano/meltano init my_project

# Run with Docker Compose
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
  meltano:
    image: meltano/meltano:latest
    volumes:
      - .:/project
    working_dir: /project
    ports:
      - "5000:5000"
    environment:
      - MELTANO_DATABASE_URI=postgresql://meltano:meltano@db:5432/meltano
  db:
    image: postgres:16
    environment:
      POSTGRES_USER: meltano
      POSTGRES_PASSWORD: meltano
      POSTGRES_DB: meltano
EOF

docker compose up -d

Project Structure

my_data_project/
├── meltano.yml              # Main configuration file
├── .env                     # Environment variables (secrets)
├── requirements.txt         # Python dependencies
├── analyze/                 # Analysis files
├── extract/                 # Custom extractor configs
├── load/                    # Custom loader configs
├── transform/               # dbt project
│   ├── dbt_project.yml
│   ├── models/
│   └── macros/
├── orchestrate/             # Airflow DAGs
│   └── dags/
├── plugins/                 # Plugin lock files
├── output/                  # Pipeline output
└── .meltano/                # Internal state (gitignored)

CLI Commands

CommandDescription
meltano init <name>Create new project
meltano add extractor <tap>Add a Singer tap
meltano add loader <target>Add a Singer target
meltano add transformer dbtAdd dbt transformer
meltano add orchestrator airflowAdd Airflow orchestrator
meltano add utility <name>Add a utility plugin
meltano installInstall all plugins
meltano install extractor <tap>Install specific extractor
meltano config <plugin> set <key> <value>Set plugin config
meltano config <plugin> listList plugin config
meltano config <plugin> testTest plugin config
meltano select <tap> <entity> <attr>Select entities/attributes to extract
meltano select <tap> --listList selected entities
meltano run <tap> <target>Run an EL pipeline
meltano run <tap> <target> dbt-postgres:runRun ELT pipeline
meltano elt <tap> <target> --transform=runLegacy ELT command
meltano invoke <plugin> <args>Run plugin command directly
meltano schedule add <name> --job <job>Add a schedule
meltano schedule listList schedules
meltano job add <name> --tasks "<tap> <target>"Define a job
meltano job listList jobs
meltano environment add <name>Add an environment
meltano testRun data tests
meltano discover extractorsBrowse available extractors
meltano discover loadersBrowse available loaders
meltano lock --update --allUpdate all plugin lock files

Configuration

meltano.yml

version: 1
project_id: my-data-project
default_environment: dev

environments:
  - name: dev
  - name: staging
  - name: prod

plugins:
  extractors:
    - name: tap-postgres
      variant: meltanolabs
      pip_url: meltanolabs-tap-postgres
      config:
        host: $PG_HOST
        port: 5432
        database: $PG_DATABASE
        user: $PG_USER
        password: $PG_PASSWORD
        filter_schemas:
          - public
          - analytics
        default_replication_method: INCREMENTAL
      select:
        - public-orders.*
        - public-customers.*
        - public-products.*
        - analytics-events.*

    - name: tap-github
      variant: meltanolabs
      pip_url: meltanolabs-tap-github
      config:
        auth_token: $GITHUB_TOKEN
        repositories:
          - org/repo-1
          - org/repo-2
        start_date: "2024-01-01T00:00:00Z"

    - name: tap-salesforce
      variant: meltanolabs
      pip_url: tap-salesforce
      config:
        client_id: $SALESFORCE_CLIENT_ID
        client_secret: $SALESFORCE_CLIENT_SECRET
        refresh_token: $SALESFORCE_REFRESH_TOKEN
        api_type: BULK

  loaders:
    - name: target-snowflake
      variant: meltanolabs
      pip_url: meltanolabs-target-snowflake
      config:
        account: $SNOWFLAKE_ACCOUNT
        user: $SNOWFLAKE_USER
        password: $SNOWFLAKE_PASSWORD
        database: RAW
        warehouse: LOADING_WH
        role: LOADER
        default_target_schema: $MELTANO_EXTRACT__LOAD_SCHEMA

    - name: target-postgres
      variant: meltanolabs
      pip_url: meltanolabs-target-postgres
      config:
        host: $TARGET_PG_HOST
        port: 5432
        database: warehouse
        user: $TARGET_PG_USER
        password: $TARGET_PG_PASSWORD
        default_target_schema: raw

  transformers:
    - name: dbt-snowflake
      variant: dbt-labs
      pip_url: dbt-core~=1.7.0 dbt-snowflake~=1.7.0
      config:
        account: $SNOWFLAKE_ACCOUNT
        user: $SNOWFLAKE_USER
        password: $SNOWFLAKE_PASSWORD
        database: ANALYTICS
        warehouse: TRANSFORM_WH
        role: TRANSFORMER
        schema: PROD

  utilities:
    - name: great_expectations
      variant: great-expectations
      pip_url: great_expectations

schedules:
  - name: daily-postgres-sync
    job: postgres-to-snowflake
    interval: "@daily"

  - name: hourly-github-sync
    job: github-to-snowflake
    interval: "0 * * * *"

jobs:
  - name: postgres-to-snowflake
    tasks:
      - tap-postgres target-snowflake
      - dbt-snowflake:run
      - dbt-snowflake:test

  - name: github-to-snowflake
    tasks:
      - tap-github target-snowflake

Environment-Specific Config

environments:
  - name: dev
    config:
      plugins:
        extractors:
          - name: tap-postgres
            config:
              host: localhost
              database: dev_db
        loaders:
          - name: target-postgres
            config:
              host: localhost
              database: dev_warehouse

  - name: prod
    config:
      plugins:
        extractors:
          - name: tap-postgres
            config:
              host: prod-db.example.com
              database: production
        loaders:
          - name: target-snowflake
            config:
              warehouse: PROD_LOADING_WH

Running Pipelines

Basic EL Pipeline

# Run extraction and loading
meltano run tap-postgres target-snowflake

# Run with specific environment
meltano --environment=prod run tap-postgres target-snowflake

# Run with state (incremental)
meltano run tap-postgres target-snowflake

# Run full pipeline (ELT)
meltano run tap-postgres target-snowflake dbt-snowflake:run dbt-snowflake:test

# Run with debug logging
meltano --log-level=debug run tap-postgres target-snowflake

Entity Selection

# Select specific tables/streams
meltano select tap-postgres public-orders "*"
meltano select tap-postgres public-customers "id,name,email"

# Exclude entities
meltano select tap-postgres --exclude public-audit_logs "*"

# List current selection
meltano select tap-postgres --list

# Select with replication method
meltano config tap-postgres set _metadata public-orders replication-method INCREMENTAL
meltano config tap-postgres set _metadata public-orders replication-key updated_at

Advanced Usage

Custom Extractors

# Create custom extractor from SDK
pip install cookiecutter
cookiecutter https://github.com/meltano/sdk --directory="cookiecutter/tap-template"
# tap_custom_api/tap.py
from singer_sdk import Tap, Stream
from singer_sdk.typing import PropertiesList, Property, StringType, IntegerType

class CustomAPIStream(Stream):
    name = "records"
    primary_keys = ["id"]
    replication_key = "updated_at"
    schema = PropertiesList(
        Property("id", IntegerType, required=True),
        Property("name", StringType),
        Property("status", StringType),
    ).to_dict()

    def get_records(self, context):
        response = self.authenticator.session.get(f"{self.url_base}/records")
        yield from response.json()["data"]

class TapCustomAPI(Tap):
    name = "tap-custom-api"
    config_jsonschema = PropertiesList(
        Property("api_key", StringType, required=True),
        Property("base_url", StringType, required=True),
    ).to_dict()

    def discover_streams(self):
        return [CustomAPIStream(self)]

Orchestration with Airflow

# Add Airflow orchestrator
meltano add orchestrator airflow
meltano invoke airflow:initialize

# Start Airflow scheduler
meltano invoke airflow scheduler &
meltano invoke airflow webserver &

# Create schedule
meltano schedule add daily-sync --job full-elt --interval "@daily"

# Airflow DAG is auto-generated from schedules

State Management

# View pipeline state
meltano state list
meltano state get dev:tap-postgres-to-target-snowflake

# Clear state (force full resync)
meltano state clear dev:tap-postgres-to-target-snowflake

# Set custom state
meltano state set dev:tap-postgres-to-target-snowflake '{"bookmarks": {}}'

# Merge state
meltano state merge dev:tap-postgres-to-target-snowflake '{"bookmarks": {"orders": {"replication_key_value": "2024-06-01"}}}'

Troubleshooting

IssueSolution
Plugin installation failsCheck Python version compatibility. Try meltano install --clean
Connection refusedVerify credentials in .env. Test with meltano config <plugin> test
No data extractedCheck entity selection with meltano select <tap> --list. Verify source has data
State not persistingCheck MELTANO_DATABASE_URI is set. Default uses SQLite in .meltano/
Incremental sync not workingVerify replication_key is set. Check state with meltano state get
Schema conflicts at targetSet default_target_schema differently per tap. Use add_record_metadata
dbt transform errorsRun meltano invoke dbt-snowflake debug. Check model SQL syntax
Environment variables not loadingVerify .env file exists. Use $VAR_NAME syntax in meltano.yml
Schedule not runningEnsure Airflow is initialized and scheduler is running. Check DAG parsing
Lock file conflictsRun meltano lock --update --all to regenerate lock files