تخطَّ إلى المحتوى

Great Expectations Cheat Sheet

Overview

Great Expectations (GX) is an open-source Python framework for validating, documenting, and profiling data. It enables data teams to define expectations (assertions) about their data and automatically validate datasets against those expectations. Each validation run produces detailed results that can be rendered as human-readable data docs, making it easy to catch data quality issues before they propagate through pipelines.

Great Expectations integrates with major data platforms including Pandas, Spark, SQL databases (PostgreSQL, BigQuery, Snowflake, Redshift), and data orchestrators like Airflow, Dagster, and Prefect. The framework supports over 300 built-in expectations covering null checks, uniqueness, value ranges, regex patterns, statistical distributions, and more. Custom expectations can be created for domain-specific validation logic. GX Cloud provides a hosted management layer with a web UI for team collaboration.

Installation

# Install Great Expectations
pip install great_expectations

# Install with specific backends
pip install 'great_expectations[spark]'
pip install 'great_expectations[sqlalchemy]'
pip install 'great_expectations[snowflake]'
pip install 'great_expectations[bigquery]'

# Initialize a project
great_expectations init

# Verify installation
great_expectations --version

# Launch Data Docs in browser
great_expectations docs build

Core Workflow

Initialize and Configure Context

import great_expectations as gx

# Create a File Data Context
context = gx.get_context()

# Or create an ephemeral context (no filesystem)
context = gx.get_context(mode="ephemeral")

Connect to Data

# Pandas data source
datasource = context.data_sources.add_pandas("my_pandas_ds")
data_asset = datasource.add_csv_asset(
    name="orders",
    filepath_or_buffer="data/orders.csv",
)

# SQL data source (PostgreSQL)
datasource = context.data_sources.add_postgres(
    name="my_pg_ds",
    connection_string="postgresql+psycopg2://user:pass@host:5432/db",
)
data_asset = datasource.add_table_asset(
    name="orders",
    table_name="orders",
    schema_name="public",
)

# Spark data source
datasource = context.data_sources.add_spark("my_spark_ds")
data_asset = datasource.add_parquet_asset(
    name="events",
    path="s3://bucket/events/",
)

# Create batch definition
batch_def = data_asset.add_batch_definition_whole_table("full_table")

Define Expectations

# Create an expectation suite
suite = context.suites.add(
    gx.ExpectationSuite(name="orders_suite")
)

# Add expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="amount", min_value=0, max_value=100000
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="status",
        value_set=["pending", "completed", "cancelled", "refunded"]
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToMatchRegex(
        column="email", regex=r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
    )
)
suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=1000, max_value=10000000
    )
)

Validate Data

# Create a validation definition
validation_def = context.validation_definitions.add(
    gx.ValidationDefinition(
        name="orders_validation",
        data=batch_def,
        suite=suite,
    )
)

# Run validation
result = validation_def.run()
print(f"Success: {result.success}")
print(f"Statistics: {result.statistics}")

# Create a checkpoint for production runs
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name="orders_checkpoint",
        validation_definitions=[validation_def],
        actions=[
            gx.checkpoint.UpdateDataDocsAction(name="update_docs"),
        ],
    )
)

# Run checkpoint
checkpoint_result = checkpoint.run()

Common Expectations

ExpectationDescription
ExpectColumnValuesToNotBeNullNo null values in column
ExpectColumnValuesToBeUniqueAll values are unique
ExpectColumnValuesToBeInSetValues match allowed set
ExpectColumnValuesToBeBetweenValues within numeric range
ExpectColumnValuesToBeBetweenDate/numeric range validation
ExpectColumnValuesToMatchRegexValues match regex pattern
ExpectColumnValueLengthsToBeBetweenString length within range
ExpectColumnValuesToBeOfTypeColumn data type check
ExpectTableRowCountToBeBetweenRow count within range
ExpectTableColumnCountToEqualExpected number of columns
ExpectColumnToExistColumn exists in dataset
ExpectCompoundColumnsToBeUniqueComposite uniqueness
ExpectColumnPairValuesAToBeGreaterThanBColumn A > Column B
ExpectColumnDistinctValuesToBeInSetDistinct values match expected set
ExpectColumnMeanToBeBetweenMean value within range
ExpectColumnStdevToBeBetweenStandard deviation within range
ExpectColumnMedianToBeBetweenMedian value within range
ExpectColumnValuesToBeDatetimeParseableValues parseable as datetime
ExpectMulticolumnSumToEqualSum of columns equals value

CLI Commands

CommandDescription
great_expectations initInitialize a new GX project
great_expectations docs buildBuild and open Data Docs
great_expectations docs listList Data Docs sites
great_expectations suite listList expectation suites
great_expectations checkpoint listList checkpoints
great_expectations checkpoint run <name>Run a checkpoint
great_expectations datasource listList configured data sources

Configuration

great_expectations.yml

config_version: 4

data_docs_sites:
  local_site:
    class_name: SiteBuilder
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

  s3_site:
    class_name: SiteBuilder
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: gx-data-docs
      prefix: docs/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

stores:
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validation_results_store:
    class_name: ValidationResultsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

Advanced Usage

Custom Expectations

from great_expectations.expectations import ExpectColumnValuesToNotBeNull

class ExpectColumnValuesToBeValidEmail(gx.expectations.UnexpectedRowsExpectation):
    unexpected_rows_query: str = """
        SELECT {batch} FROM {active_batch}
        WHERE {column} IS NOT NULL
        AND {column} NOT LIKE '%@%.%'
    """
    column: str
    description: str = "Validates email format"

Integration with Airflow

from airflow import DAG
from airflow.operators.python import PythonOperator

def validate_data():
    import great_expectations as gx
    context = gx.get_context()
    checkpoint = context.checkpoints.get("orders_checkpoint")
    result = checkpoint.run()
    if not result.success:
        raise ValueError("Data validation failed!")

with DAG("data_pipeline", schedule="@daily") as dag:
    validate = PythonOperator(
        task_id="validate_orders",
        python_callable=validate_data,
    )

Profiling

# Auto-generate expectations from data
from great_expectations.rule_based_profiler import RuleBasedProfiler

profiler = RuleBasedProfiler(
    name="my_profiler",
    config_version=1.0,
    rules={
        "column_ranges": {
            "domain_builder": {"class_name": "ColumnDomainBuilder"},
            "expectation_configuration_builders": [
                {
                    "class_name": "DefaultExpectationConfigurationBuilder",
                    "expectation_type": "expect_column_values_to_not_be_null",
                }
            ],
        }
    },
)

Troubleshooting

IssueSolution
Data Docs not renderingRun great_expectations docs build. Check store backend config
Connection errors to databaseVerify connection string format. Install required SQLAlchemy driver
Expectation suite not foundCheck suite name spelling. Run suite list to see available suites
Checkpoint fails silentlyAdd UpdateDataDocsAction to see results. Check validation definition
Spark backend OOMIncrease Spark driver/executor memory. Reduce batch size
Custom expectation not loadingEnsure it’s in plugins/expectations/ dir. Check class naming convention
Slow validation on large dataUse sampling in batch definition. Validate on partitions instead of full table
S3 permission errorsCheck IAM credentials and bucket policy. Verify boto3 is installed
Version migration issuesRun great_expectations migrate after upgrading. Check migration guide
Type errors in expectationsVerify column types match expectation parameters. Cast values explicitly