تخطَّ إلى المحتوى

GraphRAG Cheat Sheet

Overview

GraphRAG is Microsoft Research’s approach to retrieval-augmented generation that constructs a knowledge graph from source documents, then uses graph-based retrieval to answer queries. Unlike traditional vector-similarity RAG, GraphRAG extracts entities and relationships from text, builds a hierarchical community structure using the Leiden algorithm, and generates community summaries at multiple levels of abstraction. This enables superior performance on global summarization queries and multi-hop reasoning tasks.

The system operates in two phases: an indexing phase that builds the knowledge graph with entity extraction, relationship mapping, and community detection; and a query phase that uses either local search (entity-focused) or global search (community summaries) to answer questions. GraphRAG excels at questions requiring synthesis across many documents where traditional RAG fails.

Installation

pip install graphrag

# Verify installation
graphrag --version

# Initialize a new project
mkdir my-graphrag && cd my-graphrag
graphrag init --root .

Project Structure

my-graphrag/
├── settings.yaml          # Main configuration
├── .env                   # API keys
├── input/                 # Source documents
│   ├── document1.txt
│   └── document2.txt
├── output/                # Indexing results
│   └── <timestamp>/
│       ├── artifacts/
│       │   ├── create_final_entities.parquet
│       │   ├── create_final_relationships.parquet
│       │   ├── create_final_communities.parquet
│       │   └── create_final_community_reports.parquet
│       └── stats.json
└── prompts/               # Custom prompts
    ├── entity_extraction.txt
    └── community_report.txt

Core Commands

# Initialize project
graphrag init --root ./my-project

# Run indexing pipeline
graphrag index --root ./my-project

# Run with verbose logging
graphrag index --root ./my-project --verbose

# Resume failed indexing
graphrag index --root ./my-project --resume

# Query with local search (entity-focused)
graphrag query --root ./my-project \
  --method local \
  --query "What are the main characters and their relationships?"

# Query with global search (community summaries)
graphrag query --root ./my-project \
  --method global \
  --query "What are the major themes across all documents?"

# Query with drift search (hybrid)
graphrag query --root ./my-project \
  --method drift \
  --query "How do the events in chapter 1 connect to the conclusion?"

Configuration

settings.yaml

# settings.yaml
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat
  model: gpt-4o
  api_base: https://api.openai.com/v1
  max_tokens: 4000
  temperature: 0.0
  top_p: 1.0
  request_timeout: 180.0
  tokens_per_minute: 80000
  requests_per_minute: 40
  concurrent_requests: 25

parallelization:
  stagger: 0.3
  num_threads: 50

embeddings:
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding
    model: text-embedding-3-small

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

input:
  type: file
  file_type: text
  base_dir: input
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

entity_extraction:
  max_gleanings: 1
  prompt: prompts/entity_extraction.txt
  entity_types:
    - organization
    - person
    - location
    - event
    - technology

community_reports:
  max_length: 2000
  prompt: prompts/community_report.txt

local_search:
  text_unit_prop: 0.5
  community_prop: 0.1
  conversation_history_max_turns: 5
  top_k_entities: 10
  top_k_relationships: 10
  max_tokens: 12000

global_search:
  max_tokens: 12000
  data_max_tokens: 12000
  map_max_tokens: 1000
  reduce_max_tokens: 2000
  concurrency: 32

storage:
  type: file
  base_dir: output

cache:
  type: file
  base_dir: cache

Environment Variables

# .env
GRAPHRAG_API_KEY=sk-...

# For Azure OpenAI
GRAPHRAG_API_KEY=your-azure-key
GRAPHRAG_API_BASE=https://your-resource.openai.azure.com
GRAPHRAG_API_VERSION=2024-02-15-preview
GRAPHRAG_LLM_DEPLOYMENT=gpt-4o
GRAPHRAG_EMBEDDING_DEPLOYMENT=text-embedding-3-small

Search Methods

ParameterDescriptionDefault
text_unit_propWeight for text unit context0.5
community_propWeight for community context0.1
top_k_entitiesNumber of entities to retrieve10
top_k_relationshipsNumber of relationships10
conversation_history_max_turnsConversation memory5
max_tokensMax context tokens12000
ParameterDescriptionDefault
max_tokensMax total tokens12000
data_max_tokensMax data tokens per map step12000
map_max_tokensMax tokens per map response1000
reduce_max_tokensMax tokens for reduce step2000
concurrencyParallel map operations32

Python API

import asyncio
from graphrag.query.indexer_adapters import (
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.structured_search.local_search.mixed_context import LocalSearchMixedContext
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.query.structured_search.global_search.search import GlobalSearch

import pandas as pd

# Load index artifacts
entity_df = pd.read_parquet("output/artifacts/create_final_entities.parquet")
relationship_df = pd.read_parquet("output/artifacts/create_final_relationships.parquet")
report_df = pd.read_parquet("output/artifacts/create_final_community_reports.parquet")
text_unit_df = pd.read_parquet("output/artifacts/create_final_text_units.parquet")

entities = read_indexer_entities(entity_df, entity_embedding_df=None, community_level=2)
relationships = read_indexer_relationships(relationship_df)
reports = read_indexer_reports(report_df, entity_df, community_level=2)
text_units = read_indexer_text_units(text_unit_df)

# Setup LLM
llm = ChatOpenAI(api_key="sk-...", model="gpt-4o")

# Local search
context_builder = LocalSearchMixedContext(
    entities=entities,
    relationships=relationships,
    community_reports=reports,
    text_units=text_units,
    entity_text_embeddings=None
)

local_search = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=None,
    llm_params={"max_tokens": 2000, "temperature": 0.0}
)

result = asyncio.run(local_search.asearch("Who is the main character?"))
print(result.response)

Advanced Usage

Custom Entity Types

# settings.yaml
entity_extraction:
  entity_types:
    - person
    - organization
    - technology
    - product
    - vulnerability
    - attack_technique
    - mitigation

Custom Prompts

# prompts/entity_extraction.txt
-Goal-
Given a text document, identify all entities and relationships.

-Entity Types-
{entity_types}

-Steps-
1. Identify all entities with their types
2. For each pair of related entities, describe the relationship
3. Return output as JSON:
{{"entities": [{{"name": "...", "type": "...", "description": "..."}}],
  "relationships": [{{"source": "...", "target": "...", "description": "...", "weight": 1.0}}]}}

-Text-
{input_text}

Processing Large Document Sets

# Process in batches with resume capability
graphrag index --root ./my-project --verbose 2>&1 | tee index.log

# If indexing fails midway, resume from checkpoint
graphrag index --root ./my-project --resume

# Monitor progress
tail -f output/indexing-engine.log

Inspecting the Graph

import pandas as pd
import networkx as nx

# Load graph data
entities = pd.read_parquet("output/artifacts/create_final_entities.parquet")
relationships = pd.read_parquet("output/artifacts/create_final_relationships.parquet")

print(f"Entities: {len(entities)}")
print(f"Relationships: {len(relationships)}")
print(f"Entity types: {entities['type'].value_counts()}")

# Build NetworkX graph
G = nx.Graph()
for _, row in entities.iterrows():
    G.add_node(row["name"], type=row["type"])
for _, row in relationships.iterrows():
    G.add_edge(row["source"], row["target"], weight=row["weight"])

print(f"Connected components: {nx.number_connected_components(G)}")
print(f"Most connected: {sorted(G.degree, key=lambda x: x[1], reverse=True)[:10]}")

Troubleshooting

IssueSolution
Rate limiting during indexingReduce requests_per_minute and concurrent_requests
Empty entity extractionIncrease max_gleanings, check chunk size is adequate
Global search returns generic answersIncrease data_max_tokens, check community reports
Local search misses contextIncrease top_k_entities and text_unit_prop
Indexing OOMReduce chunk overlap, process fewer files at once
Cost too highUse gpt-4o-mini for entity extraction, gpt-4o for queries
Cache corruptionDelete cache/ directory and re-run indexing
Parquet read errorsEnsure indexing completed; check output/stats.json
# Estimate indexing cost
graphrag index --root ./my-project --dry-run

# View indexing statistics
cat output/stats.json | python -m json.tool

# Check entity extraction quality
python -c "
import pandas as pd
df = pd.read_parquet('output/artifacts/create_final_entities.parquet')
print(df[['name','type','description']].head(20))
"