تخطَّ إلى المحتوى

Feast Cheat Sheet

Overview

Feast (Feature Store) is an open-source feature store that helps ML teams manage and serve features for both model training and real-time inference. It bridges the gap between data engineering and ML by providing a central registry of feature definitions, consistent access to historical features for training and real-time features for serving, and tools for feature transformation, validation, and monitoring.

Feast decouples feature engineering from feature consumption by maintaining both an offline store (for historical training data) and an online store (for low-latency serving). It supports multiple backends including BigQuery, Snowflake, Redshift, PostgreSQL, DynamoDB, Redis, and SQLite. Feast handles point-in-time correct feature retrieval, preventing data leakage in training, and can materialize features from offline to online stores for production serving.

Installation

# Install Feast
pip install feast

# Install with specific backends
pip install 'feast[redis]'
pip install 'feast[aws]'
pip install 'feast[gcp]'
pip install 'feast[snowflake]'
pip install 'feast[postgres]'

# Initialize a new feature repository
feast init my_feature_repo
cd my_feature_repo

# Apply feature definitions
feast apply

# Verify setup
feast version

Core Concepts

Feature Repository Structure

my_feature_repo/
├── feature_store.yaml      # Project configuration
├── data/                   # Sample data (for local dev)
│   └── driver_stats.parquet
├── features/
│   ├── __init__.py
│   ├── driver_features.py  # Feature definitions
│   └── order_features.py
└── tests/

Feature Definitions

# features/driver_features.py
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, PushSource
from feast.types import Float32, Int64, String

# Define entities (join keys)
driver = Entity(
    name="driver",
    join_keys=["driver_id"],
    description="Driver entity",
)

# Define data source
driver_stats_source = FileSource(
    name="driver_stats_source",
    path="data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# Define feature view
driver_stats_fv = FeatureView(
    name="driver_stats",
    entities=[driver],
    ttl=timedelta(days=365),
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
        Field(name="driver_rating", dtype=Float32),
    ],
    source=driver_stats_source,
    online=True,
    tags={"team": "driver_experience"},
)

# Push source for real-time streaming
driver_stats_push = PushSource(
    name="driver_stats_push",
    batch_source=driver_stats_source,
)

# Feature view with push source
driver_stats_realtime = FeatureView(
    name="driver_stats_realtime",
    entities=[driver],
    ttl=timedelta(hours=1),
    schema=[
        Field(name="current_trips", dtype=Int64),
        Field(name="current_rating", dtype=Float32),
    ],
    source=driver_stats_push,
    online=True,
)

On-Demand Feature Views (Transformations)

from feast import on_demand_feature_view, Field
from feast.types import Float32
import pandas as pd

@on_demand_feature_view(
    sources=[driver_stats_fv],
    schema=[
        Field(name="trips_per_conversion", dtype=Float32),
        Field(name="is_high_performer", dtype=Int64),
    ],
)
def driver_performance(inputs: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame()
    df["trips_per_conversion"] = inputs["avg_daily_trips"] / (inputs["conv_rate"] + 0.001)
    df["is_high_performer"] = (inputs["driver_rating"] >= 4.5).astype(int)
    return df

CLI Commands

CommandDescription
feast init <name>Create new feature repository
feast applyRegister/update feature definitions
feast materialize <start> <end>Materialize features to online store
feast materialize-incremental <end>Incrementally materialize new features
feast entities listList registered entities
feast feature-views listList feature views
feast feature-services listList feature services
feast registry-dumpDump the feature registry
feast teardownRemove all Feast infrastructure
feast serveStart the feature server (HTTP/gRPC)
feast uiLaunch the Feast Web UI
feast versionShow Feast version
feast validateValidate feature definitions

Configuration

feature_store.yaml

project: my_project
provider: local
registry: data/registry.db
online_store:
  type: sqlite
  path: data/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 2

AWS Configuration

project: my_project
provider: aws
registry: s3://feast-bucket/registry.pb
online_store:
  type: dynamodb
  region: us-east-1
  table_name_prefix: feast_
offline_store:
  type: redshift
  cluster_id: my-redshift-cluster
  region: us-east-1
  database: feast
  user: feast_user
  s3_staging_location: s3://feast-bucket/staging/

GCP Configuration

project: my_project
provider: gcp
registry: gs://feast-bucket/registry.pb
online_store:
  type: datastore
  project_id: my-gcp-project
  namespace: feast
offline_store:
  type: bigquery
  project_id: my-gcp-project
  dataset: feast_offline

PostgreSQL Configuration

project: my_project
provider: local
registry:
  registry_type: sql
  path: postgresql://user:pass@host:5432/feast
online_store:
  type: postgres
  host: localhost
  port: 5432
  database: feast_online
  user: feast
  password: feast
offline_store:
  type: postgres
  host: localhost
  port: 5432
  database: feast_offline
  user: feast
  password: feast

Feature Retrieval

Training Data (Offline)

from feast import FeatureStore
from datetime import datetime
import pandas as pd

store = FeatureStore(repo_path=".")

# Define entity DataFrame with timestamps
entity_df = pd.DataFrame({
    "driver_id": [1001, 1002, 1003, 1004],
    "event_timestamp": [
        datetime(2024, 6, 1),
        datetime(2024, 6, 1),
        datetime(2024, 6, 15),
        datetime(2024, 6, 15),
    ],
})

# Get historical features (point-in-time correct)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_stats:conv_rate",
        "driver_stats:acc_rate",
        "driver_stats:avg_daily_trips",
        "driver_performance:is_high_performer",
    ],
).to_df()

print(training_df.head())

Online Serving

# Get online features for real-time inference
online_features = store.get_online_features(
    features=[
        "driver_stats:conv_rate",
        "driver_stats:acc_rate",
        "driver_stats:avg_daily_trips",
    ],
    entity_rows=[
        {"driver_id": 1001},
        {"driver_id": 1002},
    ],
).to_dict()

print(online_features)

Feature Server (HTTP API)

# Start feature server
feast serve --host 0.0.0.0 --port 6566

# Query features via HTTP
curl -X POST http://localhost:6566/get-online-features \
    -H "Content-Type: application/json" \
    -d '{
        "features": ["driver_stats:conv_rate", "driver_stats:avg_daily_trips"],
        "entities": {"driver_id": [1001, 1002]}
    }'

Advanced Usage

Feature Services

from feast import FeatureService

driver_v1 = FeatureService(
    name="driver_activity_v1",
    features=[
        driver_stats_fv[["conv_rate", "avg_daily_trips"]],
        driver_performance,
    ],
    tags={"release": "production"},
)

# Use in retrieval
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=driver_v1,
).to_df()

Materialization and Scheduling

# Materialize features from offline to online store
feast materialize 2024-01-01T00:00:00 2024-06-30T00:00:00

# Incremental materialization (from last run to now)
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
# Push features to online store (streaming)
from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path=".")

event_df = pd.DataFrame({
    "driver_id": [1001],
    "current_trips": [15],
    "current_rating": [4.8],
    "event_timestamp": [datetime.utcnow()],
})

store.push("driver_stats_push", event_df)

Troubleshooting

IssueSolution
feast apply failsCheck Python syntax in feature definitions. Validate feature_store.yaml
Features returning NoneEnsure materialization is up to date. Check entity key values match
Point-in-time join incorrectVerify event_timestamp column in entity DataFrame. Check TTL settings
Online store connection errorVerify credentials and network access. Check online store configuration
Slow historical retrievalUse partitioned data sources. Reduce date range in entity DataFrame
Registry sync issuesUse SQL registry for multi-user access. Avoid file-based registry in production
Feature server timeoutIncrease server timeout. Check online store latency
Push source not updatingVerify push source name matches. Check that feature view has online=True
Schema mismatchEnsure feature types match data source types. Run feast apply after changes
Materialization OOMReduce date range per materialization run. Increase worker memory