Feast Cheat Sheet
Overview
Feast (Feature Store) is an open-source feature store that helps ML teams manage and serve features for both model training and real-time inference. It bridges the gap between data engineering and ML by providing a central registry of feature definitions, consistent access to historical features for training and real-time features for serving, and tools for feature transformation, validation, and monitoring.
Feast decouples feature engineering from feature consumption by maintaining both an offline store (for historical training data) and an online store (for low-latency serving). It supports multiple backends including BigQuery, Snowflake, Redshift, PostgreSQL, DynamoDB, Redis, and SQLite. Feast handles point-in-time correct feature retrieval, preventing data leakage in training, and can materialize features from offline to online stores for production serving.
Installation
# Install Feast
pip install feast
# Install with specific backends
pip install 'feast[redis]'
pip install 'feast[aws]'
pip install 'feast[gcp]'
pip install 'feast[snowflake]'
pip install 'feast[postgres]'
# Initialize a new feature repository
feast init my_feature_repo
cd my_feature_repo
# Apply feature definitions
feast apply
# Verify setup
feast version
Core Concepts
Feature Repository Structure
my_feature_repo/
├── feature_store.yaml # Project configuration
├── data/ # Sample data (for local dev)
│ └── driver_stats.parquet
├── features/
│ ├── __init__.py
│ ├── driver_features.py # Feature definitions
│ └── order_features.py
└── tests/
Feature Definitions
# features/driver_features.py
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, PushSource
from feast.types import Float32, Int64, String
# Define entities (join keys)
driver = Entity(
name="driver",
join_keys=["driver_id"],
description="Driver entity",
)
# Define data source
driver_stats_source = FileSource(
name="driver_stats_source",
path="data/driver_stats.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created",
)
# Define feature view
driver_stats_fv = FeatureView(
name="driver_stats",
entities=[driver],
ttl=timedelta(days=365),
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="acc_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
Field(name="driver_rating", dtype=Float32),
],
source=driver_stats_source,
online=True,
tags={"team": "driver_experience"},
)
# Push source for real-time streaming
driver_stats_push = PushSource(
name="driver_stats_push",
batch_source=driver_stats_source,
)
# Feature view with push source
driver_stats_realtime = FeatureView(
name="driver_stats_realtime",
entities=[driver],
ttl=timedelta(hours=1),
schema=[
Field(name="current_trips", dtype=Int64),
Field(name="current_rating", dtype=Float32),
],
source=driver_stats_push,
online=True,
)
On-Demand Feature Views (Transformations)
from feast import on_demand_feature_view, Field
from feast.types import Float32
import pandas as pd
@on_demand_feature_view(
sources=[driver_stats_fv],
schema=[
Field(name="trips_per_conversion", dtype=Float32),
Field(name="is_high_performer", dtype=Int64),
],
)
def driver_performance(inputs: pd.DataFrame) -> pd.DataFrame:
df = pd.DataFrame()
df["trips_per_conversion"] = inputs["avg_daily_trips"] / (inputs["conv_rate"] + 0.001)
df["is_high_performer"] = (inputs["driver_rating"] >= 4.5).astype(int)
return df
CLI Commands
| Command | Description |
|---|---|
feast init <name> | Create new feature repository |
feast apply | Register/update feature definitions |
feast materialize <start> <end> | Materialize features to online store |
feast materialize-incremental <end> | Incrementally materialize new features |
feast entities list | List registered entities |
feast feature-views list | List feature views |
feast feature-services list | List feature services |
feast registry-dump | Dump the feature registry |
feast teardown | Remove all Feast infrastructure |
feast serve | Start the feature server (HTTP/gRPC) |
feast ui | Launch the Feast Web UI |
feast version | Show Feast version |
feast validate | Validate feature definitions |
Configuration
feature_store.yaml
project: my_project
provider: local
registry: data/registry.db
online_store:
type: sqlite
path: data/online_store.db
offline_store:
type: file
entity_key_serialization_version: 2
AWS Configuration
project: my_project
provider: aws
registry: s3://feast-bucket/registry.pb
online_store:
type: dynamodb
region: us-east-1
table_name_prefix: feast_
offline_store:
type: redshift
cluster_id: my-redshift-cluster
region: us-east-1
database: feast
user: feast_user
s3_staging_location: s3://feast-bucket/staging/
GCP Configuration
project: my_project
provider: gcp
registry: gs://feast-bucket/registry.pb
online_store:
type: datastore
project_id: my-gcp-project
namespace: feast
offline_store:
type: bigquery
project_id: my-gcp-project
dataset: feast_offline
PostgreSQL Configuration
project: my_project
provider: local
registry:
registry_type: sql
path: postgresql://user:pass@host:5432/feast
online_store:
type: postgres
host: localhost
port: 5432
database: feast_online
user: feast
password: feast
offline_store:
type: postgres
host: localhost
port: 5432
database: feast_offline
user: feast
password: feast
Feature Retrieval
Training Data (Offline)
from feast import FeatureStore
from datetime import datetime
import pandas as pd
store = FeatureStore(repo_path=".")
# Define entity DataFrame with timestamps
entity_df = pd.DataFrame({
"driver_id": [1001, 1002, 1003, 1004],
"event_timestamp": [
datetime(2024, 6, 1),
datetime(2024, 6, 1),
datetime(2024, 6, 15),
datetime(2024, 6, 15),
],
})
# Get historical features (point-in-time correct)
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"driver_stats:conv_rate",
"driver_stats:acc_rate",
"driver_stats:avg_daily_trips",
"driver_performance:is_high_performer",
],
).to_df()
print(training_df.head())
Online Serving
# Get online features for real-time inference
online_features = store.get_online_features(
features=[
"driver_stats:conv_rate",
"driver_stats:acc_rate",
"driver_stats:avg_daily_trips",
],
entity_rows=[
{"driver_id": 1001},
{"driver_id": 1002},
],
).to_dict()
print(online_features)
Feature Server (HTTP API)
# Start feature server
feast serve --host 0.0.0.0 --port 6566
# Query features via HTTP
curl -X POST http://localhost:6566/get-online-features \
-H "Content-Type: application/json" \
-d '{
"features": ["driver_stats:conv_rate", "driver_stats:avg_daily_trips"],
"entities": {"driver_id": [1001, 1002]}
}'
Advanced Usage
Feature Services
from feast import FeatureService
driver_v1 = FeatureService(
name="driver_activity_v1",
features=[
driver_stats_fv[["conv_rate", "avg_daily_trips"]],
driver_performance,
],
tags={"release": "production"},
)
# Use in retrieval
training_df = store.get_historical_features(
entity_df=entity_df,
features=driver_v1,
).to_df()
Materialization and Scheduling
# Materialize features from offline to online store
feast materialize 2024-01-01T00:00:00 2024-06-30T00:00:00
# Incremental materialization (from last run to now)
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
# Push features to online store (streaming)
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path=".")
event_df = pd.DataFrame({
"driver_id": [1001],
"current_trips": [15],
"current_rating": [4.8],
"event_timestamp": [datetime.utcnow()],
})
store.push("driver_stats_push", event_df)
Troubleshooting
| Issue | Solution |
|---|---|
feast apply fails | Check Python syntax in feature definitions. Validate feature_store.yaml |
| Features returning None | Ensure materialization is up to date. Check entity key values match |
| Point-in-time join incorrect | Verify event_timestamp column in entity DataFrame. Check TTL settings |
| Online store connection error | Verify credentials and network access. Check online store configuration |
| Slow historical retrieval | Use partitioned data sources. Reduce date range in entity DataFrame |
| Registry sync issues | Use SQL registry for multi-user access. Avoid file-based registry in production |
| Feature server timeout | Increase server timeout. Check online store latency |
| Push source not updating | Verify push source name matches. Check that feature view has online=True |
| Schema mismatch | Ensure feature types match data source types. Run feast apply after changes |
| Materialization OOM | Reduce date range per materialization run. Increase worker memory |