BentoML Cheat Sheet
Overview
BentoML is an open-source framework for building, shipping, and scaling AI applications. It provides a standardized way to package trained ML models with their serving logic, dependencies, and configurations into portable artifacts called Bentos. These Bentos can be deployed as microservices, containerized with Docker, or deployed to cloud platforms including BentoCloud, AWS, GCP, and Kubernetes clusters.
BentoML supports all major ML frameworks including PyTorch, TensorFlow, Scikit-learn, XGBoost, LightGBM, HuggingFace Transformers, ONNX, and custom models. It provides features like adaptive batching for throughput optimization, built-in model management with versioning, parallel inference, GPU support, and a runner architecture that separates API logic from model computation. The framework handles the entire lifecycle from model development to production deployment.
Installation
# Install BentoML
pip install bentoml
# Install with specific framework support
pip install "bentoml[pytorch]"
pip install "bentoml[tensorflow]"
pip install "bentoml[transformers]"
# Verify installation
bentoml --version
# Check available models in local store
bentoml models list
Core Workflow
Save a Model
import bentoml
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Train model
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)
# Save to BentoML model store
saved_model = bentoml.sklearn.save_model(
"iris_classifier",
clf,
labels={"team": "data-science", "stage": "production"},
metadata={"accuracy": 0.96, "dataset": "iris"},
signatures={"predict": {"batchable": True}},
)
print(f"Model saved: {saved_model}")
Define a Service
# service.py
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON
from pydantic import BaseModel
from typing import List
class IrisFeatures(BaseModel):
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
class PredictionResponse(BaseModel):
prediction: int
probability: List[float]
# Load model reference
iris_model = bentoml.sklearn.get("iris_classifier:latest")
# Create a runner
iris_runner = iris_model.to_runner()
# Define the service
svc = bentoml.Service("iris_service", runners=[iris_runner])
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
async def predict(input_array: np.ndarray) -> np.ndarray:
return await iris_runner.predict.async_run(input_array)
@svc.api(input=JSON(pydantic_model=IrisFeatures), output=JSON(pydantic_model=PredictionResponse))
async def classify(features: IrisFeatures) -> PredictionResponse:
input_array = np.array([[
features.sepal_length,
features.sepal_width,
features.petal_length,
features.petal_width
]])
prediction = await iris_runner.predict.async_run(input_array)
probabilities = await iris_runner.predict_proba.async_run(input_array)
return PredictionResponse(
prediction=int(prediction[0]),
probability=probabilities[0].tolist()
)
Serve Locally
# Start development server
bentoml serve service:svc --reload --port 3000
# Or specify the service file
bentoml serve service.py:svc
Build and Deploy
# Build a Bento
bentoml build
# List built Bentos
bentoml list
# Containerize as Docker image
bentoml containerize iris_service:latest
# Run Docker container
docker run -p 3000:3000 iris_service:latest
CLI Commands
| Command | Description |
|---|---|
bentoml serve <service> | Start development server |
bentoml build | Build a Bento from service |
bentoml list | List all built Bentos |
bentoml get <bento> | Get Bento details |
bentoml delete <bento> | Delete a Bento |
bentoml export <bento> <path> | Export Bento to file |
bentoml import <path> | Import Bento from file |
bentoml containerize <bento> | Build Docker image |
bentoml push <bento> | Push to BentoCloud |
bentoml pull <bento> | Pull from BentoCloud |
bentoml models list | List saved models |
bentoml models get <model> | Get model details |
bentoml models delete <model> | Delete a model |
bentoml models export <model> <path> | Export model |
bentoml models import <path> | Import model |
bentoml deploy <bento> | Deploy to BentoCloud |
bentoml deployment list | List deployments |
Configuration
bentofile.yaml
service: "service:svc"
labels:
team: ml-platform
project: iris-classifier
include:
- "*.py"
- "config/*.yaml"
exclude:
- "tests/"
- "*.ipynb"
python:
packages:
- scikit-learn==1.4.0
- numpy>=1.24
- pydantic>=2.0
lock_packages: true
envs:
- name: MODEL_TIMEOUT
value: "60"
- name: BENTOML_CONFIG
value: "production"
docker:
distro: debian
python_version: "3.11"
cuda_version: null
system_packages:
- libgomp1
setup_script: "./setup.sh"
dockerfile_template: null
Service Configuration
# Configuration via decorators
@svc.api(
input=NumpyNdarray(dtype="float64", shape=(-1, 4)),
output=NumpyNdarray(dtype="int64"),
route="/v1/predict",
)
async def predict(input_array: np.ndarray) -> np.ndarray:
return await iris_runner.predict.async_run(input_array)
# Runner configuration
iris_runner = iris_model.to_runner(
name="iris_runner",
max_batch_size=100,
max_latency_ms=500,
)
Adaptive Batching
# Configure batching at model save time
bentoml.sklearn.save_model(
"iris_classifier",
clf,
signatures={
"predict": {
"batchable": True,
"batch_dim": 0,
}
},
)
# Or configure at runner creation
runner = model.to_runner(
max_batch_size=64,
max_latency_ms=300,
)
Advanced Usage
Multi-Model Service
import bentoml
from bentoml.io import JSON, NumpyNdarray
preprocessor = bentoml.sklearn.get("preprocessor:latest").to_runner()
classifier = bentoml.sklearn.get("classifier:latest").to_runner()
postprocessor = bentoml.sklearn.get("postprocessor:latest").to_runner()
svc = bentoml.Service(
"pipeline_service",
runners=[preprocessor, classifier, postprocessor]
)
@svc.api(input=JSON(), output=JSON())
async def predict(input_data: dict) -> dict:
features = await preprocessor.transform.async_run(input_data["features"])
predictions = await classifier.predict.async_run(features)
result = await postprocessor.format.async_run(predictions)
return {"result": result}
GPU Configuration
import bentoml
# Save with GPU resource requirements
runner = model.to_runner(
name="gpu_runner",
runnable_init_params={"device": "cuda"},
)
# In bentofile.yaml
# docker:
# cuda_version: "11.8"
Custom Runner
import bentoml
class MyCustomRunnable(bentoml.Runnable):
SUPPORTED_RESOURCES = ("nvidia.com/gpu", "cpu")
SUPPORTS_CPU_MULTI_THREADING = True
def __init__(self):
import torch
self.model = torch.load("model.pt")
self.model.eval()
@bentoml.Runnable.method(batchable=True, batch_dim=0)
def predict(self, input_tensor):
import torch
with torch.no_grad():
return self.model(input_tensor)
custom_runner = bentoml.Runner(
MyCustomRunnable,
name="custom_runner",
max_batch_size=32,
)
svc = bentoml.Service("custom_service", runners=[custom_runner])
Monitoring and Metrics
# BentoML exposes Prometheus metrics at /metrics
# Key metrics:
# - bentoml_api_request_duration_seconds
# - bentoml_api_request_total
# - bentoml_runner_request_duration_seconds
# - bentoml_runner_request_total
# - bentoml_runner_adaptive_batch_size
# Access via: curl http://localhost:3000/metrics
Troubleshooting
| Issue | Solution |
|---|---|
| Model not found | Check bentoml models list. Verify model name and version tag |
| Build fails | Check bentofile.yaml syntax. Verify all included files exist |
| Container build fails | Check Docker daemon is running. Verify base image compatibility |
| Slow inference | Enable adaptive batching. Check runner resource allocation |
| GPU not detected | Verify CUDA drivers installed. Set cuda_version in bentofile.yaml |
| Import errors in container | Add missing packages to bentofile.yaml python.packages list |
| Port already in use | Change port with --port flag. Kill existing process on the port |
| Runner timeout | Increase timeout configuration. Check model loading time |
| Batch dimension mismatch | Verify batch_dim in signatures matches input array shape |
| Memory leak in service | Check for retained references in service code. Monitor with /metrics |