Kubeflow Cheat Sheet

Overview

Kubeflow is an open-source machine learning toolkit for Kubernetes that makes deploying and managing end-to-end ML workflows simple, portable, and scalable. It provides a comprehensive suite of tools covering the entire ML lifecycle: Kubeflow Pipelines for orchestrating ML workflows, Katib for hyperparameter tuning, KServe for model serving, Training Operators for distributed training, and Notebooks for interactive development environments.

Kubeflow leverages Kubernetes for resource management, scaling, and multi-tenancy, making it suitable for organizations that want to standardize their ML infrastructure. It supports all major ML frameworks (TensorFlow, PyTorch, XGBoost, MXNet) and integrates with popular tools in the ML ecosystem. Kubeflow Pipelines is the most widely adopted component, enabling teams to build reproducible, versioned ML pipelines with a visual UI for tracking experiments and runs.

Installation

Kubeflow on Kubernetes

# Install using kustomize (full Kubeflow platform)
git clone https://github.com/kubeflow/manifests.git
cd manifests
while ! kustomize build example | kubectl apply -f -; do
    echo "Retrying to apply resources"
    sleep 10
done

# Verify installation
kubectl get pods -n kubeflow

# Access Kubeflow Dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
# Open http://localhost:8080

Kubeflow Pipelines Only

# Install standalone Kubeflow Pipelines
export PIPELINE_VERSION=2.1.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"

# Access Pipelines UI
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

KFP SDK

# Install Kubeflow Pipelines SDK
pip install kfp==2.7.0

# Install additional components
pip install kfp-kubernetes   # Kubernetes-specific features
pip install kfp-server-api   # Pipeline server API client

Kubeflow Pipelines

Define a Pipeline

from kfp import dsl, compiler
from kfp.dsl import Input, Output, Dataset, Model, Metrics

@dsl.component(
    base_image="python:3.11-slim",
    packages_to_install=["pandas", "scikit-learn"],
)
def load_data(dataset: Output[Dataset]):
    import pandas as pd
    from sklearn.datasets import load_iris

    X, y = load_iris(return_X_y=True, as_frame=True)
    df = pd.concat([X, y.rename("target")], axis=1)
    df.to_csv(dataset.path, index=False)

@dsl.component(
    base_image="python:3.11-slim",
    packages_to_install=["pandas", "scikit-learn", "joblib"],
)
def train_model(
    dataset: Input[Dataset],
    model: Output[Model],
    metrics: Output[Metrics],
    n_estimators: int = 100,
):
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score
    import joblib

    df = pd.read_csv(dataset.path)
    X = df.drop("target", axis=1)
    y = df["target"]

    clf = RandomForestClassifier(n_estimators=n_estimators)
    scores = cross_val_score(clf, X, y, cv=5)

    clf.fit(X, y)
    joblib.dump(clf, model.path)

    metrics.log_metric("accuracy_mean", float(scores.mean()))
    metrics.log_metric("accuracy_std", float(scores.std()))

@dsl.component(
    base_image="python:3.11-slim",
    packages_to_install=["pandas", "scikit-learn", "joblib"],
)
def evaluate_model(
    model: Input[Model],
    dataset: Input[Dataset],
    metrics: Output[Metrics],
) -> float:
    import pandas as pd
    from sklearn.metrics import accuracy_score
    import joblib

    df = pd.read_csv(dataset.path)
    X = df.drop("target", axis=1)
    y = df["target"]

    clf = joblib.load(model.path)
    predictions = clf.predict(X)
    accuracy = accuracy_score(y, predictions)

    metrics.log_metric("test_accuracy", accuracy)
    return accuracy

@dsl.pipeline(
    name="ML Training Pipeline",
    description="End-to-end ML training with evaluation",
)
def ml_pipeline(n_estimators: int = 100):
    load_task = load_data()
    train_task = train_model(
        dataset=load_task.outputs["dataset"],
        n_estimators=n_estimators,
    )
    eval_task = evaluate_model(
        model=train_task.outputs["model"],
        dataset=load_task.outputs["dataset"],
    )

# Compile pipeline
compiler.Compiler().compile(ml_pipeline, "pipeline.yaml")

Submit Pipeline Run

import kfp

# Connect to Kubeflow Pipelines
client = kfp.Client(host="http://localhost:8080")

# Upload pipeline
pipeline = client.upload_pipeline(
    pipeline_package_path="pipeline.yaml",
    pipeline_name="ML Training Pipeline",
    description="Training pipeline with evaluation",
)

# Create a run
run = client.create_run_from_pipeline_package(
    pipeline_file="pipeline.yaml",
    arguments={"n_estimators": 200},
    run_name="training-run-001",
    experiment_name="iris-experiments",
)

# Or create run from uploaded pipeline
run = client.run_pipeline(
    experiment_id=experiment.id,
    job_name="training-run-002",
    pipeline_id=pipeline.id,
    params={"n_estimators": 300},
)

# Create recurring run
recurring_run = client.create_recurring_run(
    experiment_id=experiment.id,
    job_name="daily-training",
    pipeline_id=pipeline.id,
    cron_expression="0 6 * * *",
    params={"n_estimators": 100},
)

CLI Commands

Command	Description
`kfp pipeline create -p pipeline.yaml`	Upload a pipeline
`kfp pipeline list`	List pipelines
`kfp run submit -e exp -r name -f pipeline.yaml`	Submit a run
`kfp run list -e experiment`	List runs in experiment
`kfp run get <run_id>`	Get run details
`kfp component build <dir>`	Build a component
`kfp dsl compile --py pipeline.py --output pipeline.yaml`	Compile pipeline

Katib (Hyperparameter Tuning)

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-search
  namespace: kubeflow
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
    - name: n_estimators
      parameterType: int
      feasibleSpace:
        min: "50"
        max: "500"
    - name: max_depth
      parameterType: int
      feasibleSpace:
        min: "3"
        max: "20"
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.001"
        max: "0.1"
  trialTemplate:
    primaryContainerName: training
    trialParameters:
      - name: n_estimators
        reference: n_estimators
      - name: max_depth
        reference: max_depth
      - name: learning_rate
        reference: learning_rate
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training
                image: my-registry/training:latest
                command:
                  - python
                  - train.py
                  - --n_estimators=${trialParameters.n_estimators}
                  - --max_depth=${trialParameters.max_depth}
                  - --learning_rate=${trialParameters.learning_rate}
            restartPolicy: Never

Training Operators

PyTorch Distributed Training

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-dist-training
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: my-registry/pytorch-train:latest
              resources:
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: pytorch
              image: my-registry/pytorch-train:latest
              resources:
                limits:
                  nvidia.com/gpu: 1

KServe (Model Serving)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: kubeflow
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: gs://models/sklearn/iris
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
        limits:
          cpu: "2"
          memory: "4Gi"
  transformer:
    containers:
      - image: my-registry/transformer:latest
        name: transformer
  explainer:
    containers:
      - image: my-registry/explainer:latest
        name: explainer

Configuration

Kubeflow Profile (Multi-Tenancy)

apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  name: team-data-science
spec:
  owner:
    kind: User
    name: user@example.com
  resourceQuotaSpec:
    hard:
      requests.cpu: "16"
      requests.memory: "64Gi"
      limits.cpu: "32"
      limits.memory: "128Gi"
      requests.nvidia.com/gpu: "4"

Troubleshooting

Issue	Solution
Pipeline pod stuck in Pending	Check node resources. Verify resource requests fit available nodes
Component image pull failed	Verify image exists in registry. Check imagePullSecrets configuration
Pipeline compilation error	Check KFP SDK version compatibility. Verify component signatures
Katib experiment not starting	Check Katib controller logs. Verify trial template is valid
Training job OOM killed	Increase memory limits. Reduce batch size in training code
KServe model load fails	Check storage URI and credentials. Verify model format compatibility
Dashboard login fails	Check Dex/OIDC configuration. Verify user profile exists
Pipeline artifacts missing	Check MinIO/S3 storage. Verify artifact store configuration
Notebook server won’t start	Check PVC availability. Verify GPU resources if GPU notebook requested
Istio gateway errors	Check Istio configuration. Verify VirtualService and Gateway resources