MLServer Cheat Sheet
Overview
MLServer is an open-source inference server for machine learning models, developed by Seldon. It implements the V2 Inference Protocol (also known as the Open Inference Protocol), providing a standardized REST and gRPC API for serving ML models. MLServer supports multiple ML frameworks out of the box including Scikit-learn, XGBoost, LightGBM, Spark MLlib, HuggingFace Transformers, and custom Python models.
MLServer is designed for production deployments with features like adaptive batching, multi-model serving on a single server instance, parallel inference across models, model hot-reloading, and built-in support for monitoring and observability. It integrates seamlessly with Kubernetes-based platforms like Seldon Core and KServe, and can be deployed standalone or in containerized environments. The server supports model repositories for managing multiple model versions.
Installation
# Install MLServer with all runtimes
pip install mlserver
# Install specific runtimes
pip install mlserver-sklearn # Scikit-learn
pip install mlserver-xgboost # XGBoost
pip install mlserver-lightgbm # LightGBM
pip install mlserver-mlflow # MLflow models
pip install mlserver-huggingface # HuggingFace Transformers
pip install mlserver-alibi-explain # Model explanations
pip install mlserver-alibi-detect # Drift detection
# Verify installation
mlserver --version
Quick Start
Train and Save a Model
# train_model.py
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import joblib
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
joblib.dump(model, "model/model.joblib")
Model Settings
// model/model-settings.json
{
"name": "iris-classifier",
"implementation": "mlserver_sklearn.SKLearnModel",
"parameters": {
"uri": "./model.joblib",
"version": "v1.0"
}
}
Server Settings
// settings.json
{
"debug": true,
"host": "0.0.0.0",
"http_port": 8080,
"grpc_port": 8081,
"metrics_port": 8082
}
Start the Server
# Start MLServer
mlserver start .
# Or start with specific model directory
mlserver start model/
Make Predictions
# REST API inference request
curl -X POST http://localhost:8080/v2/models/iris-classifier/infer \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "predict",
"shape": [1, 4],
"datatype": "FP64",
"data": [5.1, 3.5, 1.4, 0.2]
}
]
}'
# Check model readiness
curl http://localhost:8080/v2/models/iris-classifier/ready
# Get model metadata
curl http://localhost:8080/v2/models/iris-classifier
# Server health
curl http://localhost:8080/v2/health/ready
curl http://localhost:8080/v2/health/live
V2 Inference Protocol API
| Endpoint | Method | Description |
|---|---|---|
/v2/health/live | GET | Server liveness check |
/v2/health/ready | GET | Server readiness check |
/v2 | GET | Server metadata |
/v2/models/{model}/ready | GET | Model readiness |
/v2/models/{model} | GET | Model metadata |
/v2/models/{model}/infer | POST | Run inference |
/v2/models/{model}/versions/{ver}/infer | POST | Versioned inference |
/v2/repository/index | POST | List models in repository |
/v2/repository/models/{model}/load | POST | Load a model |
/v2/repository/models/{model}/unload | POST | Unload a model |
Runtime Configurations
XGBoost
{
"name": "xgboost-model",
"implementation": "mlserver_xgboost.XGBoostModel",
"parameters": {
"uri": "./model.bst",
"version": "v1.0"
}
}
LightGBM
{
"name": "lightgbm-model",
"implementation": "mlserver_lightgbm.LightGBMModel",
"parameters": {
"uri": "./model.bst",
"version": "v1.0"
}
}
HuggingFace Transformers
{
"name": "sentiment-analysis",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"parameters": {
"extra": {
"task": "text-classification",
"pretrained_model": "distilbert-base-uncased-finetuned-sst-2-english"
}
}
}
MLflow Model
{
"name": "mlflow-model",
"implementation": "mlserver_mlflow.MLflowRuntime",
"parameters": {
"uri": "models:/my_model/Production"
}
}
Configuration
Custom Runtime
# custom_runtime.py
from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs import NumpyCodec
import numpy as np
class CustomModel(MLModel):
async def load(self) -> bool:
"""Load model artifacts."""
model_uri = self.settings.parameters.uri
self._model = self._load_model(model_uri)
self.ready = True
return self.ready
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
"""Run inference."""
input_data = self.decode(payload.inputs[0])
predictions = self._model.predict(input_data)
return InferenceResponse(
model_name=self.name,
outputs=[
ResponseOutput(
name="predictions",
shape=list(predictions.shape),
datatype="FP64",
data=predictions.tolist(),
)
],
)
def _load_model(self, uri):
import joblib
return joblib.load(uri)
// model-settings.json for custom runtime
{
"name": "custom-model",
"implementation": "custom_runtime.CustomModel",
"parameters": {
"uri": "./model.joblib"
}
}
Adaptive Batching
{
"name": "batch-model",
"implementation": "mlserver_sklearn.SKLearnModel",
"parameters": {
"uri": "./model.joblib"
},
"max_batch_size": 32,
"max_batch_time": 0.5
}
Multi-Model Serving
repository/
├── settings.json # Server settings
├── model-a/
│ ├── model-settings.json
│ └── model.joblib
├── model-b/
│ ├── model-settings.json
│ └── model.bst
└── model-c/
├── model-settings.json
└── model.pkl
Advanced Usage
Docker Deployment
FROM python:3.11-slim
RUN pip install mlserver mlserver-sklearn mlserver-xgboost
COPY . /opt/mlserver
WORKDIR /opt/mlserver
EXPOSE 8080 8081 8082
CMD ["mlserver", "start", "."]
docker build -t my-mlserver .
docker run -p 8080:8080 -p 8081:8081 my-mlserver
Python Client
import requests
import numpy as np
# Prepare request
payload = {
"inputs": [
{
"name": "predict",
"shape": [3, 4],
"datatype": "FP64",
"data": np.random.rand(3, 4).flatten().tolist()
}
]
}
# Send request
response = requests.post(
"http://localhost:8080/v2/models/iris-classifier/infer",
json=payload
)
result = response.json()
predictions = result["outputs"][0]["data"]
print(f"Predictions: {predictions}")
gRPC Client
import grpc
import mlserver.grpc.converters as converters
from mlserver.grpc import dataplane_pb2, dataplane_pb2_grpc
channel = grpc.insecure_channel("localhost:8081")
stub = dataplane_pb2_grpc.GRPCInferenceServiceStub(channel)
request = dataplane_pb2.ModelInferRequest(
model_name="iris-classifier",
)
# Add inputs via protobuf
response = stub.ModelInfer(request)
Troubleshooting
| Issue | Solution |
|---|---|
| Model fails to load | Check uri path in model-settings.json. Verify model file exists |
| Wrong predictions/shape | Verify input shape and datatype match training data format |
| Server won’t start | Check port conflicts. Verify all required packages are installed |
| gRPC connection refused | Ensure grpc_port is set and not blocked by firewall |
| Adaptive batching not working | Set both max_batch_size and max_batch_time in model settings |
| High latency | Enable adaptive batching. Check model complexity and input sizes |
| Multi-model memory issues | Reduce loaded models. Use model repository with on-demand loading |
| Custom runtime import error | Ensure module is in Python path. Check class name in model-settings.json |
| Hot-reload not triggering | Verify file watcher is enabled. Check model directory permissions |
| Metrics not available | Ensure metrics_port is configured. Access /metrics endpoint on metrics port |