Overview
GraphRAG is Microsoft Research’s approach to retrieval-augmented generation that constructs a knowledge graph from source documents, then uses graph-based retrieval to answer queries. Unlike traditional vector-similarity RAG, GraphRAG extracts entities and relationships from text, builds a hierarchical community structure using the Leiden algorithm, and generates community summaries at multiple levels of abstraction. This enables superior performance on global summarization queries and multi-hop reasoning tasks.
The system operates in two phases: an indexing phase that builds the knowledge graph with entity extraction, relationship mapping, and community detection; and a query phase that uses either local search (entity-focused) or global search (community summaries) to answer questions. GraphRAG excels at questions requiring synthesis across many documents where traditional RAG fails.
Installation
pip install graphrag
# Verify installation
graphrag --version
# Initialize a new project
mkdir my-graphrag && cd my-graphrag
graphrag init --root .
Project Structure
my-graphrag/
├── settings.yaml # Main configuration
├── .env # API keys
├── input/ # Source documents
│ ├── document1.txt
│ └── document2.txt
├── output/ # Indexing results
│ └── <timestamp>/
│ ├── artifacts/
│ │ ├── create_final_entities.parquet
│ │ ├── create_final_relationships.parquet
│ │ ├── create_final_communities.parquet
│ │ └── create_final_community_reports.parquet
│ └── stats.json
└── prompts/ # Custom prompts
├── entity_extraction.txt
└── community_report.txt
Core Commands
# Initialize project
graphrag init --root ./my-project
# Run indexing pipeline
graphrag index --root ./my-project
# Run with verbose logging
graphrag index --root ./my-project --verbose
# Resume failed indexing
graphrag index --root ./my-project --resume
# Query with local search (entity-focused)
graphrag query --root ./my-project \
--method local \
--query "What are the main characters and their relationships?"
# Query with global search (community summaries)
graphrag query --root ./my-project \
--method global \
--query "What are the major themes across all documents?"
# Query with drift search (hybrid)
graphrag query --root ./my-project \
--method drift \
--query "How do the events in chapter 1 connect to the conclusion?"
Configuration
settings.yaml
# settings.yaml
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat
model: gpt-4o
api_base: https://api.openai.com/v1
max_tokens: 4000
temperature: 0.0
top_p: 1.0
request_timeout: 180.0
tokens_per_minute: 80000
requests_per_minute: 40
concurrent_requests: 25
parallelization:
stagger: 0.3
num_threads: 50
embeddings:
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding
model: text-embedding-3-small
chunks:
size: 1200
overlap: 100
group_by_columns: [id]
input:
type: file
file_type: text
base_dir: input
file_encoding: utf-8
file_pattern: ".*\\.txt$"
entity_extraction:
max_gleanings: 1
prompt: prompts/entity_extraction.txt
entity_types:
- organization
- person
- location
- event
- technology
community_reports:
max_length: 2000
prompt: prompts/community_report.txt
local_search:
text_unit_prop: 0.5
community_prop: 0.1
conversation_history_max_turns: 5
top_k_entities: 10
top_k_relationships: 10
max_tokens: 12000
global_search:
max_tokens: 12000
data_max_tokens: 12000
map_max_tokens: 1000
reduce_max_tokens: 2000
concurrency: 32
storage:
type: file
base_dir: output
cache:
type: file
base_dir: cache
Environment Variables
# .env
GRAPHRAG_API_KEY=sk-...
# For Azure OpenAI
GRAPHRAG_API_KEY=your-azure-key
GRAPHRAG_API_BASE=https://your-resource.openai.azure.com
GRAPHRAG_API_VERSION=2024-02-15-preview
GRAPHRAG_LLM_DEPLOYMENT=gpt-4o
GRAPHRAG_EMBEDDING_DEPLOYMENT=text-embedding-3-small
Search Methods
Local Search
| Parameter | Description | Default |
|---|
text_unit_prop | Weight for text unit context | 0.5 |
community_prop | Weight for community context | 0.1 |
top_k_entities | Number of entities to retrieve | 10 |
top_k_relationships | Number of relationships | 10 |
conversation_history_max_turns | Conversation memory | 5 |
max_tokens | Max context tokens | 12000 |
Global Search
| Parameter | Description | Default |
|---|
max_tokens | Max total tokens | 12000 |
data_max_tokens | Max data tokens per map step | 12000 |
map_max_tokens | Max tokens per map response | 1000 |
reduce_max_tokens | Max tokens for reduce step | 2000 |
concurrency | Parallel map operations | 32 |
Python API
import asyncio
from graphrag.query.indexer_adapters import (
read_indexer_entities,
read_indexer_relationships,
read_indexer_reports,
read_indexer_text_units,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.structured_search.local_search.mixed_context import LocalSearchMixedContext
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.query.structured_search.global_search.search import GlobalSearch
import pandas as pd
# Load index artifacts
entity_df = pd.read_parquet("output/artifacts/create_final_entities.parquet")
relationship_df = pd.read_parquet("output/artifacts/create_final_relationships.parquet")
report_df = pd.read_parquet("output/artifacts/create_final_community_reports.parquet")
text_unit_df = pd.read_parquet("output/artifacts/create_final_text_units.parquet")
entities = read_indexer_entities(entity_df, entity_embedding_df=None, community_level=2)
relationships = read_indexer_relationships(relationship_df)
reports = read_indexer_reports(report_df, entity_df, community_level=2)
text_units = read_indexer_text_units(text_unit_df)
# Setup LLM
llm = ChatOpenAI(api_key="sk-...", model="gpt-4o")
# Local search
context_builder = LocalSearchMixedContext(
entities=entities,
relationships=relationships,
community_reports=reports,
text_units=text_units,
entity_text_embeddings=None
)
local_search = LocalSearch(
llm=llm,
context_builder=context_builder,
token_encoder=None,
llm_params={"max_tokens": 2000, "temperature": 0.0}
)
result = asyncio.run(local_search.asearch("Who is the main character?"))
print(result.response)
Advanced Usage
Custom Entity Types
# settings.yaml
entity_extraction:
entity_types:
- person
- organization
- technology
- product
- vulnerability
- attack_technique
- mitigation
Custom Prompts
# prompts/entity_extraction.txt
-Goal-
Given a text document, identify all entities and relationships.
-Entity Types-
{entity_types}
-Steps-
1. Identify all entities with their types
2. For each pair of related entities, describe the relationship
3. Return output as JSON:
{{"entities": [{{"name": "...", "type": "...", "description": "..."}}],
"relationships": [{{"source": "...", "target": "...", "description": "...", "weight": 1.0}}]}}
-Text-
{input_text}
Processing Large Document Sets
# Process in batches with resume capability
graphrag index --root ./my-project --verbose 2>&1 | tee index.log
# If indexing fails midway, resume from checkpoint
graphrag index --root ./my-project --resume
# Monitor progress
tail -f output/indexing-engine.log
Inspecting the Graph
import pandas as pd
import networkx as nx
# Load graph data
entities = pd.read_parquet("output/artifacts/create_final_entities.parquet")
relationships = pd.read_parquet("output/artifacts/create_final_relationships.parquet")
print(f"Entities: {len(entities)}")
print(f"Relationships: {len(relationships)}")
print(f"Entity types: {entities['type'].value_counts()}")
# Build NetworkX graph
G = nx.Graph()
for _, row in entities.iterrows():
G.add_node(row["name"], type=row["type"])
for _, row in relationships.iterrows():
G.add_edge(row["source"], row["target"], weight=row["weight"])
print(f"Connected components: {nx.number_connected_components(G)}")
print(f"Most connected: {sorted(G.degree, key=lambda x: x[1], reverse=True)[:10]}")
Troubleshooting
| Issue | Solution |
|---|
| Rate limiting during indexing | Reduce requests_per_minute and concurrent_requests |
| Empty entity extraction | Increase max_gleanings, check chunk size is adequate |
| Global search returns generic answers | Increase data_max_tokens, check community reports |
| Local search misses context | Increase top_k_entities and text_unit_prop |
| Indexing OOM | Reduce chunk overlap, process fewer files at once |
| Cost too high | Use gpt-4o-mini for entity extraction, gpt-4o for queries |
| Cache corruption | Delete cache/ directory and re-run indexing |
| Parquet read errors | Ensure indexing completed; check output/stats.json |
# Estimate indexing cost
graphrag index --root ./my-project --dry-run
# View indexing statistics
cat output/stats.json | python -m json.tool
# Check entity extraction quality
python -c "
import pandas as pd
df = pd.read_parquet('output/artifacts/create_final_entities.parquet')
print(df[['name','type','description']].head(20))
"