Modal

Installation

# Install the Modal client
pip install modal

# Authenticate (creates token in ~/.modal.toml)
modal token new

# Or set environment variable
export MODAL_TOKEN_ID=...
export MODAL_TOKEN_SECRET=...

# Verify
modal --version
modal profile current

Configuration

`modal.toml` (Auto-generated by `modal token new`)

[default]
token_id = "ak-xxxxxxxxxxxx"
token_secret = "as-xxxxxxxxxxxx"

App Definition

# app.py — every Modal project starts with an App
import modal

app = modal.App("my-ml-app")

# Custom container image
image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "torch==2.3.0",
        "transformers==4.40.0",
        "accelerate",
        "datasets",
        "pillow",
    )
    .run_commands(
        "apt-get install -y libgl1",
    )
)

Core Commands

CLI

Command	Description
`modal run app.py`	Run the app locally (executes remotely)
`modal run app.py::my_function`	Run a specific function
`modal deploy app.py`	Deploy the app (persistent)
`modal serve app.py`	Serve with hot-reload (dev mode)
`modal shell app.py::my_image`	Open interactive shell in container
`modal shell --cmd bash`	Open bash in image
`modal app list`	List deployed apps
`modal app stop my-app`	Stop a deployed app
`modal app logs my-app`	Stream app logs
`modal container list`	List running containers
`modal container exec <id> bash`	Exec into a running container
`modal volume create my-vol`	Create a volume
`modal volume list`	List volumes
`modal volume put my-vol ./data /data`	Upload files to volume
`modal volume get my-vol /data ./local`	Download files from volume
`modal volume ls my-vol /`	List volume contents
`modal secret create MY_SECRET key=value`	Create a secret
`modal secret list`	List secrets
`modal profile list`	List auth profiles
`modal profile activate myprofile`	Switch profile
`modal token new`	Generate new token
`modal nfs create my-nfs`	Create network filesystem

Advanced Usage

Functions and GPU Scheduling

import modal

app = modal.App("gpu-training")

image = modal.Image.debian_slim().pip_install("torch", "torchvision")

@app.function(
    image=image,
    gpu="A10G",                    # A10G | A100 | A100-80GB | H100 | T4 | L4 | any
    cpu=4,                         # CPU cores
    memory=32768,                  # Memory in MB (32GB)
    timeout=3600,                  # Seconds (default 300, max 86400)
    retries=3,                     # Auto-retry on failure
    concurrency_limit=10,          # Max parallel instances
)
def train_model(config: dict) -> dict:
    import torch
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using: {device} — {torch.cuda.get_device_name(0)}")
    # ... training code ...
    return {"loss": 0.05, "accuracy": 0.97}


# Run locally (executes on Modal's cloud)
if __name__ == "__main__":
    with app.run():
        result = train_model.remote({"lr": 0.001, "epochs": 10})
        print(result)

Classes (Warm Containers)

import modal

app = modal.App("inference-service")

image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")

@app.cls(
    image=image,
    gpu="A10G",
    container_idle_timeout=300,    # Keep warm for 5 min after last request
    allow_concurrent_inputs=10,    # Handle 10 requests per container
)
class TextClassifier:
    @modal.enter()                 # Runs once on container start
    def load_model(self):
        from transformers import pipeline
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0
        )

    @modal.method()
    def classify(self, text: str) -> dict:
        return self.classifier(text)[0]

    @modal.exit()                  # Runs on container shutdown
    def cleanup(self):
        del self.classifier

# Call the class
@app.local_entrypoint()
def main():
    classifier = TextClassifier()
    result = classifier.classify.remote("Modal is incredible!")
    print(result)

Volumes (Persistent Storage)

import modal

app = modal.App("training-with-storage")

# Create or reference a volume
volume = modal.Volume.from_name("training-data", create_if_missing=True)

image = modal.Image.debian_slim().pip_install("torch", "datasets")

@app.function(
    image=image,
    gpu="A10G",
    volumes={"/data": volume},     # Mount at /data inside container
)
def download_and_train():
    import os
    # Check if data already exists (persisted from a previous run)
    if not os.path.exists("/data/dataset.json"):
        from datasets import load_dataset
        ds = load_dataset("imdb", split="train")
        ds.save_to_disk("/data/dataset.json")
        volume.commit()            # Flush writes to the volume

    # Load from volume
    from datasets import load_from_disk
    ds = load_from_disk("/data/dataset.json")
    # ... train ...
    
    # Save model checkpoint to volume
    model.save_pretrained("/data/checkpoints/epoch_1")
    volume.commit()

Secrets

import modal

# Create secrets via CLI:
# modal secret create openai-secret OPENAI_API_KEY=sk-...
# modal secret create aws-creds AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=...

openai_secret = modal.Secret.from_name("openai-secret")
aws_secret = modal.Secret.from_name("aws-creds")

# Local secrets (for local development)
local_secret = modal.Secret.from_dict({"MY_KEY": "my-value"})
env_secret = modal.Secret.from_dotenv(".env.production")

app = modal.App("secret-demo")

@app.function(secrets=[openai_secret, aws_secret])
def call_openai():
    import os
    import openai
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    # ...

Web Endpoints

import modal
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse

app = modal.App("web-api")

# Simple web endpoint
@app.function(gpu="T4")
@modal.web_endpoint(method="POST")
def generate(item: dict) -> dict:
    text = item["text"]
    result = run_model(text)
    return {"result": result}

# Full FastAPI app
web_app = FastAPI()

@app.function(
    image=modal.Image.debian_slim().pip_install("fastapi", "torch", "transformers"),
    gpu="A10G",
    container_idle_timeout=300,
)
@modal.asgi_app()
def fastapi_app():
    return web_app

@web_app.get("/health")
async def health():
    return {"status": "ok"}

@web_app.post("/classify")
async def classify(request: Request):
    body = await request.json()
    result = run_inference(body["text"])
    return JSONResponse({"label": result})

Parallel Map

import modal

app = modal.App("parallel-processing")

@app.function(cpu=2, memory=4096, timeout=600)
def process_item(item: dict) -> dict:
    # CPU-bound processing
    return {"id": item["id"], "result": heavy_computation(item)}


@app.local_entrypoint()
def main():
    items = load_large_dataset()   # 10,000 items

    # map — returns results in order
    results = list(process_item.map(items))

    # starmap — for functions with multiple args
    pairs = [(item, config) for item in items]
    results = list(process_item.starmap(pairs))

    # for_each — fire and forget
    process_item.for_each(items)

    # map with error handling
    for result in process_item.map(items, return_exceptions=True):
        if isinstance(result, Exception):
            print(f"Error: {result}")
        else:
            handle(result)

Cron Jobs and Schedules

import modal
from datetime import datetime

app = modal.App("scheduled-jobs")

@app.function(schedule=modal.Period(hours=6))     # Every 6 hours
def refresh_cache():
    print(f"Refreshing at {datetime.now()}")
    fetch_and_store()

@app.function(schedule=modal.Cron("0 9 * * MON-FRI"))  # 9am weekdays UTC
def send_daily_report():
    report = generate_report()
    send_email(report)

@app.function(schedule=modal.Period(days=1))
def daily_retraining():
    train_model()
    deploy_model()

Custom Container Images

import modal

# Build from Dockerfile
dockerfile_image = modal.Image.from_dockerfile("./Dockerfile")

# Debian slim — most common
image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("libgl1", "libglib2.0-0")
    .pip_install("opencv-python-headless", "torch", "ultralytics")
    .copy_local_file("./config.yaml", "/app/config.yaml")
    .env({"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:128"})
    .run_commands("python -c 'import torch; print(torch.__version__)'")
)

# From a public Docker image
custom_image = modal.Image.from_registry("nvcr.io/nvidia/pytorch:24.03-py3")

# Micromamba (conda-compatible)
conda_image = (
    modal.Image.micromamba(python_version="3.11")
    .micromamba_install("cudatoolkit=11.8", channels=["conda-forge", "nvidia"])
    .pip_install("torch", "transformers")
)

Fine-Tuning on Demand

import modal

app = modal.App("llm-finetuning")

volume = modal.Volume.from_name("finetuning-outputs", create_if_missing=True)

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("transformers", "peft", "trl", "bitsandbytes", "accelerate", "datasets")
)

@app.function(
    image=image,
    gpu="A100-80GB",
    timeout=7200,                  # 2 hours
    volumes={"/outputs": volume},
    secrets=[modal.Secret.from_name("huggingface-token")],
)
def finetune_llm(
    base_model: str = "meta-llama/Llama-3-8b-instruct",
    dataset_name: str = "my-org/my-dataset",
    num_epochs: int = 3,
):
    import os
    from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
    from peft import LoraConfig, get_peft_model
    from trl import SFTTrainer

    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        token=os.environ["HUGGING_FACE_HUB_TOKEN"],
        load_in_4bit=True,
        device_map="auto"
    )

    lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
    model = get_peft_model(model, lora_config)

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            output_dir="/outputs/checkpoints",
            num_train_epochs=num_epochs,
        )
    )
    trainer.train()
    model.save_pretrained("/outputs/final-model")
    volume.commit()

    return {"status": "complete", "output": "/outputs/final-model"}

Network File System (NFS)

import modal

# For shared read-write access across concurrent functions
nfs = modal.NetworkFileSystem.from_name("shared-nfs", create_if_missing=True)

@app.function(network_file_systems={"/shared": nfs})
def write_results(data):
    with open("/shared/results.json", "a") as f:
        f.write(data)

Common Workflows

Local Development → Cloud Deployment

# Dev: run with hot reload
modal serve app.py

# Test a single function
modal run app.py::train_model

# Interactive debugging
modal shell app.py::image

# Deploy to production
modal deploy app.py

# Check deployment
modal app list
modal app logs my-ml-app

Batch Inference Pipeline

@app.function(gpu="T4", timeout=300)
def run_inference(batch: list[str]) -> list[dict]:
    model = load_model()
    return [model.predict(text) for text in batch]

@app.local_entrypoint()
def main():
    texts = load_texts()                    # 100,000 texts
    batch_size = 64
    batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]

    all_results = []
    for batch_result in run_inference.map(batches):
        all_results.extend(batch_result)

    save_results(all_results)

Reproducible Experiments

@app.function(
    image=image,
    gpu="A100",
    volumes={"/data": data_volume, "/outputs": output_volume},
    secrets=[modal.Secret.from_name("wandb-secret")],
)
def experiment(config: dict):
    import wandb
    import os

    wandb.init(project="my-project", config=config)
    results = train(**config)
    wandb.log(results)

    # Save checkpoint
    save_checkpoint(results["model"], f"/outputs/{wandb.run.id}")
    output_volume.commit()

    return results

Tips and Best Practices

Define images at module level — Modal caches image layers; rebuilding only when pip_install or commands change avoids unnecessary image builds.
Use @app.cls with container_idle_timeout for inference services — keeps containers warm to eliminate cold start latency for real-time APIs.
volume.commit() must be called explicitly after writes — Modal volumes are not automatically synced; forgetting this loses data.
GPU selection matters for cost — T4 is cheapest for inference, A10G is a good training balance, A100-80GB for large models; use gpu="any" for flexibility.
timeout defaults to 300s — always set a longer timeout for training jobs; the max is 86400s (24 hours).
Secrets never appear in logs or images — they are injected as environment variables at runtime; never bake secrets into the image with env().
modal shell app.py::image opens an interactive container with your image — invaluable for debugging dependency issues.
return_exceptions=True in .map() prevents one failed item from killing the entire batch — handle errors item by item.
Use modal serve for web endpoint development — it hot-reloads on file changes so you can iterate without redeploying.
Volumes are regional — ensure your volume and functions are in the same region to avoid cross-region transfer costs and latency.
allow_concurrent_inputs on @app.cls enables a single container to handle multiple requests — critical for GPU cost efficiency on inference workloads.
Prefer @app.local_entrypoint() over if __name__ == "__main__" for Modal apps — it integrates with the Modal CLI and runs inside the Modal context automatically.

Modal

Installation

Configuration

modal.toml (Auto-generated by modal token new)

App Definition

Core Commands

CLI

Advanced Usage

Functions and GPU Scheduling

Classes (Warm Containers)

Volumes (Persistent Storage)

Secrets

Web Endpoints

Parallel Map

Cron Jobs and Schedules

Custom Container Images

Fine-Tuning on Demand

Network File System (NFS)

Common Workflows

Local Development → Cloud Deployment

Batch Inference Pipeline

Reproducible Experiments

Tips and Best Practices

`modal.toml` (Auto-generated by `modal token new`)