Kubeflow Cheat Sheet
Overview
Kubeflow is an open-source machine learning toolkit for Kubernetes that makes deploying and managing end-to-end ML workflows simple, portable, and scalable. It provides a comprehensive suite of tools covering the entire ML lifecycle: Kubeflow Pipelines for orchestrating ML workflows, Katib for hyperparameter tuning, KServe for model serving, Training Operators for distributed training, and Notebooks for interactive development environments.
Kubeflow leverages Kubernetes for resource management, scaling, and multi-tenancy, making it suitable for organizations that want to standardize their ML infrastructure. It supports all major ML frameworks (TensorFlow, PyTorch, XGBoost, MXNet) and integrates with popular tools in the ML ecosystem. Kubeflow Pipelines is the most widely adopted component, enabling teams to build reproducible, versioned ML pipelines with a visual UI for tracking experiments and runs.
Installation
Kubeflow on Kubernetes
# Install using kustomize (full Kubeflow platform)
git clone https://github.com/kubeflow/manifests.git
cd manifests
while ! kustomize build example | kubectl apply -f -; do
echo "Retrying to apply resources"
sleep 10
done
# Verify installation
kubectl get pods -n kubeflow
# Access Kubeflow Dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
# Open http://localhost:8080
Kubeflow Pipelines Only
# Install standalone Kubeflow Pipelines
export PIPELINE_VERSION=2.1.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"
# Access Pipelines UI
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
KFP SDK
# Install Kubeflow Pipelines SDK
pip install kfp==2.7.0
# Install additional components
pip install kfp-kubernetes # Kubernetes-specific features
pip install kfp-server-api # Pipeline server API client
Kubeflow Pipelines
Define a Pipeline
from kfp import dsl, compiler
from kfp.dsl import Input, Output, Dataset, Model, Metrics
@dsl.component(
base_image="python:3.11-slim",
packages_to_install=["pandas", "scikit-learn"],
)
def load_data(dataset: Output[Dataset]):
import pandas as pd
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True, as_frame=True)
df = pd.concat([X, y.rename("target")], axis=1)
df.to_csv(dataset.path, index=False)
@dsl.component(
base_image="python:3.11-slim",
packages_to_install=["pandas", "scikit-learn", "joblib"],
)
def train_model(
dataset: Input[Dataset],
model: Output[Model],
metrics: Output[Metrics],
n_estimators: int = 100,
):
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import joblib
df = pd.read_csv(dataset.path)
X = df.drop("target", axis=1)
y = df["target"]
clf = RandomForestClassifier(n_estimators=n_estimators)
scores = cross_val_score(clf, X, y, cv=5)
clf.fit(X, y)
joblib.dump(clf, model.path)
metrics.log_metric("accuracy_mean", float(scores.mean()))
metrics.log_metric("accuracy_std", float(scores.std()))
@dsl.component(
base_image="python:3.11-slim",
packages_to_install=["pandas", "scikit-learn", "joblib"],
)
def evaluate_model(
model: Input[Model],
dataset: Input[Dataset],
metrics: Output[Metrics],
) -> float:
import pandas as pd
from sklearn.metrics import accuracy_score
import joblib
df = pd.read_csv(dataset.path)
X = df.drop("target", axis=1)
y = df["target"]
clf = joblib.load(model.path)
predictions = clf.predict(X)
accuracy = accuracy_score(y, predictions)
metrics.log_metric("test_accuracy", accuracy)
return accuracy
@dsl.pipeline(
name="ML Training Pipeline",
description="End-to-end ML training with evaluation",
)
def ml_pipeline(n_estimators: int = 100):
load_task = load_data()
train_task = train_model(
dataset=load_task.outputs["dataset"],
n_estimators=n_estimators,
)
eval_task = evaluate_model(
model=train_task.outputs["model"],
dataset=load_task.outputs["dataset"],
)
# Compile pipeline
compiler.Compiler().compile(ml_pipeline, "pipeline.yaml")
Submit Pipeline Run
import kfp
# Connect to Kubeflow Pipelines
client = kfp.Client(host="http://localhost:8080")
# Upload pipeline
pipeline = client.upload_pipeline(
pipeline_package_path="pipeline.yaml",
pipeline_name="ML Training Pipeline",
description="Training pipeline with evaluation",
)
# Create a run
run = client.create_run_from_pipeline_package(
pipeline_file="pipeline.yaml",
arguments={"n_estimators": 200},
run_name="training-run-001",
experiment_name="iris-experiments",
)
# Or create run from uploaded pipeline
run = client.run_pipeline(
experiment_id=experiment.id,
job_name="training-run-002",
pipeline_id=pipeline.id,
params={"n_estimators": 300},
)
# Create recurring run
recurring_run = client.create_recurring_run(
experiment_id=experiment.id,
job_name="daily-training",
pipeline_id=pipeline.id,
cron_expression="0 6 * * *",
params={"n_estimators": 100},
)
CLI Commands
| Command | Description |
|---|---|
kfp pipeline create -p pipeline.yaml | Upload a pipeline |
kfp pipeline list | List pipelines |
kfp run submit -e exp -r name -f pipeline.yaml | Submit a run |
kfp run list -e experiment | List runs in experiment |
kfp run get <run_id> | Get run details |
kfp component build <dir> | Build a component |
kfp dsl compile --py pipeline.py --output pipeline.yaml | Compile pipeline |
Katib (Hyperparameter Tuning)
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: random-search
namespace: kubeflow
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: n_estimators
parameterType: int
feasibleSpace:
min: "50"
max: "500"
- name: max_depth
parameterType: int
feasibleSpace:
min: "3"
max: "20"
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
trialTemplate:
primaryContainerName: training
trialParameters:
- name: n_estimators
reference: n_estimators
- name: max_depth
reference: max_depth
- name: learning_rate
reference: learning_rate
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training
image: my-registry/training:latest
command:
- python
- train.py
- --n_estimators=${trialParameters.n_estimators}
- --max_depth=${trialParameters.max_depth}
- --learning_rate=${trialParameters.learning_rate}
restartPolicy: Never
Training Operators
PyTorch Distributed Training
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-dist-training
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: my-registry/pytorch-train:latest
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 3
template:
spec:
containers:
- name: pytorch
image: my-registry/pytorch-train:latest
resources:
limits:
nvidia.com/gpu: 1
KServe (Model Serving)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
namespace: kubeflow
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: gs://models/sklearn/iris
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
transformer:
containers:
- image: my-registry/transformer:latest
name: transformer
explainer:
containers:
- image: my-registry/explainer:latest
name: explainer
Configuration
Kubeflow Profile (Multi-Tenancy)
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
name: team-data-science
spec:
owner:
kind: User
name: user@example.com
resourceQuotaSpec:
hard:
requests.cpu: "16"
requests.memory: "64Gi"
limits.cpu: "32"
limits.memory: "128Gi"
requests.nvidia.com/gpu: "4"
Troubleshooting
| Issue | Solution |
|---|---|
| Pipeline pod stuck in Pending | Check node resources. Verify resource requests fit available nodes |
| Component image pull failed | Verify image exists in registry. Check imagePullSecrets configuration |
| Pipeline compilation error | Check KFP SDK version compatibility. Verify component signatures |
| Katib experiment not starting | Check Katib controller logs. Verify trial template is valid |
| Training job OOM killed | Increase memory limits. Reduce batch size in training code |
| KServe model load fails | Check storage URI and credentials. Verify model format compatibility |
| Dashboard login fails | Check Dex/OIDC configuration. Verify user profile exists |
| Pipeline artifacts missing | Check MinIO/S3 storage. Verify artifact store configuration |
| Notebook server won’t start | Check PVC availability. Verify GPU resources if GPU notebook requested |
| Istio gateway errors | Check Istio configuration. Verify VirtualService and Gateway resources |