Overview
Great Expectations (GX) is an open-source Python framework for validating, documenting, and profiling data. It enables data teams to define expectations (assertions) about their data and automatically validate datasets against those expectations. Each validation run produces detailed results that can be rendered as human-readable data docs, making it easy to catch data quality issues before they propagate through pipelines.
Great Expectations integrates with major data platforms including Pandas, Spark, SQL databases (PostgreSQL, BigQuery, Snowflake, Redshift), and data orchestrators like Airflow, Dagster, and Prefect. The framework supports over 300 built-in expectations covering null checks, uniqueness, value ranges, regex patterns, statistical distributions, and more. Custom expectations can be created for domain-specific validation logic. GX Cloud provides a hosted management layer with a web UI for team collaboration.
Installation
# Install Great Expectations
pip install great_expectations
# Install with specific backends
pip install 'great_expectations[spark]'
pip install 'great_expectations[sqlalchemy]'
pip install 'great_expectations[snowflake]'
pip install 'great_expectations[bigquery]'
# Initialize a project
great_expectations init
# Verify installation
great_expectations --version
# Launch Data Docs in browser
great_expectations docs build
Core Workflow
Initialize and Configure Context
import great_expectations as gx
# Create a File Data Context
context = gx.get_context()
# Or create an ephemeral context (no filesystem)
context = gx.get_context(mode="ephemeral")
Connect to Data
# Pandas data source
datasource = context.data_sources.add_pandas("my_pandas_ds")
data_asset = datasource.add_csv_asset(
name="orders",
filepath_or_buffer="data/orders.csv",
)
# SQL data source (PostgreSQL)
datasource = context.data_sources.add_postgres(
name="my_pg_ds",
connection_string="postgresql+psycopg2://user:pass@host:5432/db",
)
data_asset = datasource.add_table_asset(
name="orders",
table_name="orders",
schema_name="public",
)
# Spark data source
datasource = context.data_sources.add_spark("my_spark_ds")
data_asset = datasource.add_parquet_asset(
name="events",
path="s3://bucket/events/",
)
# Create batch definition
batch_def = data_asset.add_batch_definition_whole_table("full_table")
Define Expectations
# Create an expectation suite
suite = context.suites.add(
gx.ExpectationSuite(name="orders_suite")
)
# Add expectations
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="amount", min_value=0, max_value=100000
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeInSet(
column="status",
value_set=["pending", "completed", "cancelled", "refunded"]
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToMatchRegex(
column="email", regex=r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
)
)
suite.add_expectation(
gx.expectations.ExpectTableRowCountToBeBetween(
min_value=1000, max_value=10000000
)
)
Validate Data
# Create a validation definition
validation_def = context.validation_definitions.add(
gx.ValidationDefinition(
name="orders_validation",
data=batch_def,
suite=suite,
)
)
# Run validation
result = validation_def.run()
print(f"Success: {result.success}")
print(f"Statistics: {result.statistics}")
# Create a checkpoint for production runs
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name="orders_checkpoint",
validation_definitions=[validation_def],
actions=[
gx.checkpoint.UpdateDataDocsAction(name="update_docs"),
],
)
)
# Run checkpoint
checkpoint_result = checkpoint.run()
Common Expectations
| Expectation | Description |
|---|
ExpectColumnValuesToNotBeNull | No null values in column |
ExpectColumnValuesToBeUnique | All values are unique |
ExpectColumnValuesToBeInSet | Values match allowed set |
ExpectColumnValuesToBeBetween | Values within numeric range |
ExpectColumnValuesToBeBetween | Date/numeric range validation |
ExpectColumnValuesToMatchRegex | Values match regex pattern |
ExpectColumnValueLengthsToBeBetween | String length within range |
ExpectColumnValuesToBeOfType | Column data type check |
ExpectTableRowCountToBeBetween | Row count within range |
ExpectTableColumnCountToEqual | Expected number of columns |
ExpectColumnToExist | Column exists in dataset |
ExpectCompoundColumnsToBeUnique | Composite uniqueness |
ExpectColumnPairValuesAToBeGreaterThanB | Column A > Column B |
ExpectColumnDistinctValuesToBeInSet | Distinct values match expected set |
ExpectColumnMeanToBeBetween | Mean value within range |
ExpectColumnStdevToBeBetween | Standard deviation within range |
ExpectColumnMedianToBeBetween | Median value within range |
ExpectColumnValuesToBeDatetimeParseable | Values parseable as datetime |
ExpectMulticolumnSumToEqual | Sum of columns equals value |
CLI Commands
| Command | Description |
|---|
great_expectations init | Initialize a new GX project |
great_expectations docs build | Build and open Data Docs |
great_expectations docs list | List Data Docs sites |
great_expectations suite list | List expectation suites |
great_expectations checkpoint list | List checkpoints |
great_expectations checkpoint run <name> | Run a checkpoint |
great_expectations datasource list | List configured data sources |
Configuration
great_expectations.yml
config_version: 4
data_docs_sites:
local_site:
class_name: SiteBuilder
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
s3_site:
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: gx-data-docs
prefix: docs/
site_index_builder:
class_name: DefaultSiteIndexBuilder
stores:
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
validation_results_store:
class_name: ValidationResultsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/
Advanced Usage
Custom Expectations
from great_expectations.expectations import ExpectColumnValuesToNotBeNull
class ExpectColumnValuesToBeValidEmail(gx.expectations.UnexpectedRowsExpectation):
unexpected_rows_query: str = """
SELECT {batch} FROM {active_batch}
WHERE {column} IS NOT NULL
AND {column} NOT LIKE '%@%.%'
"""
column: str
description: str = "Validates email format"
Integration with Airflow
from airflow import DAG
from airflow.operators.python import PythonOperator
def validate_data():
import great_expectations as gx
context = gx.get_context()
checkpoint = context.checkpoints.get("orders_checkpoint")
result = checkpoint.run()
if not result.success:
raise ValueError("Data validation failed!")
with DAG("data_pipeline", schedule="@daily") as dag:
validate = PythonOperator(
task_id="validate_orders",
python_callable=validate_data,
)
Profiling
# Auto-generate expectations from data
from great_expectations.rule_based_profiler import RuleBasedProfiler
profiler = RuleBasedProfiler(
name="my_profiler",
config_version=1.0,
rules={
"column_ranges": {
"domain_builder": {"class_name": "ColumnDomainBuilder"},
"expectation_configuration_builders": [
{
"class_name": "DefaultExpectationConfigurationBuilder",
"expectation_type": "expect_column_values_to_not_be_null",
}
],
}
},
)
Troubleshooting
| Issue | Solution |
|---|
| Data Docs not rendering | Run great_expectations docs build. Check store backend config |
| Connection errors to database | Verify connection string format. Install required SQLAlchemy driver |
| Expectation suite not found | Check suite name spelling. Run suite list to see available suites |
| Checkpoint fails silently | Add UpdateDataDocsAction to see results. Check validation definition |
| Spark backend OOM | Increase Spark driver/executor memory. Reduce batch size |
| Custom expectation not loading | Ensure it’s in plugins/expectations/ dir. Check class naming convention |
| Slow validation on large data | Use sampling in batch definition. Validate on partitions instead of full table |
| S3 permission errors | Check IAM credentials and bucket policy. Verify boto3 is installed |
| Version migration issues | Run great_expectations migrate after upgrading. Check migration guide |
| Type errors in expectations | Verify column types match expectation parameters. Cast values explicitly |