Overview
Firecrawl is a web scraping API designed to convert entire websites into clean, LLM-ready Markdown or structured data. Unlike traditional scrapers, Firecrawl handles JavaScript-rendered pages, manages proxies and rate limiting, bypasses common anti-bot measures, and produces clean text output stripped of navigation elements, ads, and boilerplate. It provides scrape (single page), crawl (entire site), map (discover URLs), and extract (structured data) endpoints.
The tool is purpose-built for AI applications: populating RAG knowledge bases, creating training datasets, monitoring competitor content, and extracting structured information from web pages. Firecrawl offers both a hosted API and a self-hosted open-source version.
Installation
Python SDK
pip install firecrawl-py
Node.js SDK
npm install @mendable/firecrawl-js
Self-Hosted
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
cp .env.example .env
# Edit .env with configuration
docker compose up -d
# API at http://localhost:3002
Core Operations
Scrape (Single Page)
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# Basic scrape to Markdown
result = app.scrape_url("https://example.com/article")
print(result["markdown"])
# Scrape with options
result = app.scrape_url(
"https://example.com/article",
params={
"formats": ["markdown", "html", "links"],
"onlyMainContent": True,
"waitFor": 2000, # Wait 2s for JS rendering
"timeout": 30000,
"headers": {"Accept-Language": "en-US"},
}
)
print(result["markdown"]) # Clean Markdown
print(result["html"]) # Raw HTML
print(result["metadata"]["title"]) # Page title
print(result["links"]) # Extracted links
Crawl (Entire Site)
# Start crawl job
crawl_result = app.crawl_url(
"https://docs.example.com",
params={
"limit": 100, # Max pages
"maxDepth": 3, # Max link depth
"includePaths": ["/docs/*"], # Only crawl docs
"excludePaths": ["/blog/*"], # Skip blog
"allowBackwardLinks": False,
"allowExternalLinks": False,
},
poll_interval=5 # Check status every 5 seconds
)
# Process results
for page in crawl_result["data"]:
print(f"URL: {page['metadata']['sourceURL']}")
print(f"Title: {page['metadata']['title']}")
print(f"Content: {page['markdown'][:200]}...")
print("---")
Async Crawl
# Start async crawl (returns job ID)
job = app.async_crawl_url(
"https://docs.example.com",
params={"limit": 500, "maxDepth": 5}
)
print(f"Job ID: {job['id']}")
# Check status
status = app.check_crawl_status(job["id"])
print(f"Status: {status['status']}")
print(f"Completed: {status['completed']}/{status['total']}")
# Get results when complete
if status["status"] == "completed":
for page in status["data"]:
print(page["metadata"]["sourceURL"])
Map (Discover URLs)
# Discover all URLs on a site
map_result = app.map_url(
"https://example.com",
params={
"search": "pricing", # Optional: filter by relevance
"limit": 500
}
)
for url in map_result["links"]:
print(url)
# Extract structured data using LLM
result = app.scrape_url(
"https://example.com/product",
params={
"formats": ["extract"],
"extract": {
"schema": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
"features": {
"type": "array",
"items": {"type": "string"}
},
"rating": {"type": "number"}
},
"required": ["product_name", "price"]
}
}
}
)
extracted = result["extract"]
print(f"Product: {extracted['product_name']}")
print(f"Price: ${extracted['price']}")
print(f"Features: {extracted['features']}")
Batch Scrape
# Scrape multiple URLs in parallel
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
batch_result = app.batch_scrape_urls(
urls,
params={"formats": ["markdown"], "onlyMainContent": True}
)
for page in batch_result["data"]:
print(f"{page['metadata']['sourceURL']}: {len(page['markdown'])} chars")
REST API
# Scrape
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"onlyMainContent": true
}'
# Crawl
curl -X POST https://api.firecrawl.dev/v1/crawl \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"limit": 100,
"maxDepth": 3
}'
# Check crawl status
curl https://api.firecrawl.dev/v1/crawl/JOB_ID \
-H "Authorization: Bearer fc-YOUR_API_KEY"
# Map
curl -X POST https://api.firecrawl.dev/v1/map \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Configuration
Scrape Parameters
| Parameter | Type | Description |
|---|
formats | array | Output formats: markdown, html, links, extract, screenshot |
onlyMainContent | bool | Remove nav, headers, footers |
includeTags | array | CSS selectors to include |
excludeTags | array | CSS selectors to exclude |
waitFor | int | Milliseconds to wait for JS |
timeout | int | Request timeout in ms |
headers | object | Custom HTTP headers |
actions | array | Browser actions before scraping |
Crawl Parameters
| Parameter | Type | Description |
|---|
limit | int | Max pages to crawl |
maxDepth | int | Max link depth from start |
includePaths | array | URL patterns to include |
excludePaths | array | URL patterns to exclude |
allowBackwardLinks | bool | Allow links to parent pages |
allowExternalLinks | bool | Follow external links |
ignoreSitemap | bool | Skip sitemap discovery |
Browser Actions
# Interact with page before scraping
result = app.scrape_url(
"https://example.com",
params={
"formats": ["markdown"],
"actions": [
{"type": "click", "selector": "#load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down", "amount": 3},
{"type": "screenshot"},
]
}
)
Advanced Usage
Integration with RAG
from firecrawl import FirecrawlApp
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
app = FirecrawlApp(api_key="fc-YOUR_KEY")
# Crawl documentation site
crawl_result = app.crawl_url(
"https://docs.example.com",
params={"limit": 200, "maxDepth": 3, "includePaths": ["/docs/*"]}
)
# Chunk and index
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_chunks = []
for page in crawl_result["data"]:
chunks = splitter.split_text(page["markdown"])
for chunk in chunks:
all_chunks.append({
"text": chunk,
"source": page["metadata"]["sourceURL"],
"title": page["metadata"].get("title", "")
})
# Store in vector DB
texts = [c["text"] for c in all_chunks]
metadatas = [{"source": c["source"], "title": c["title"]} for c in all_chunks]
vectorstore = Chroma.from_texts(
texts=texts,
metadatas=metadatas,
embedding=OpenAIEmbeddings()
)
print(f"Indexed {len(all_chunks)} chunks from {len(crawl_result['data'])} pages")
Monitoring and Change Detection
import hashlib
import json
from datetime import datetime
def monitor_pages(urls, app, state_file="page_state.json"):
try:
with open(state_file) as f:
state = json.load(f)
except FileNotFoundError:
state = {}
changes = []
for url in urls:
result = app.scrape_url(url, params={"formats": ["markdown"]})
content = result["markdown"]
content_hash = hashlib.md5(content.encode()).hexdigest()
if url in state and state[url]["hash"] != content_hash:
changes.append({
"url": url,
"changed_at": datetime.now().isoformat(),
"old_hash": state[url]["hash"],
"new_hash": content_hash
})
state[url] = {"hash": content_hash, "checked": datetime.now().isoformat()}
with open(state_file, "w") as f:
json.dump(state, f)
return changes
Self-Hosted Configuration
# .env for self-hosted
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
SUPABASE_URL=http://supabase:54321
USE_DB_AUTHENTICATION=false
SCRAPING_BEE_API_KEY=optional
OPENAI_API_KEY=sk-... # For extract endpoint
Troubleshooting
| Issue | Solution |
|---|
| 401 Unauthorized | Check API key is valid and has credits |
| Page returns empty | Site may block scrapers; try waitFor param |
| JavaScript not rendering | Increase waitFor timeout (3000-5000ms) |
| Crawl missing pages | Check includePaths, increase maxDepth |
| Rate limited | Reduce concurrent requests, add delays |
| Timeout errors | Increase timeout param, check site availability |
| Extract returns null | Verify JSON schema matches page content |
| Self-hosted Redis error | Check Redis connection in .env |
# Test API key
curl https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer fc-YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'
# Check self-hosted health
curl http://localhost:3002/health
# View crawl logs
docker compose logs -f api