Firecrawl Cheat Sheet

Overview

Firecrawl is a web scraping API designed to convert entire websites into clean, LLM-ready Markdown or structured data. Unlike traditional scrapers, Firecrawl handles JavaScript-rendered pages, manages proxies and rate limiting, bypasses common anti-bot measures, and produces clean text output stripped of navigation elements, ads, and boilerplate. It provides scrape (single page), crawl (entire site), map (discover URLs), and extract (structured data) endpoints.

The tool is purpose-built for AI applications: populating RAG knowledge bases, creating training datasets, monitoring competitor content, and extracting structured information from web pages. Firecrawl offers both a hosted API and a self-hosted open-source version.

Installation

Python SDK

pip install firecrawl-py

Node.js SDK

npm install @mendable/firecrawl-js

Self-Hosted

git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
cp .env.example .env
# Edit .env with configuration

docker compose up -d
# API at http://localhost:3002

Core Operations

Scrape (Single Page)

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Basic scrape to Markdown
result = app.scrape_url("https://example.com/article")
print(result["markdown"])

# Scrape with options
result = app.scrape_url(
    "https://example.com/article",
    params={
        "formats": ["markdown", "html", "links"],
        "onlyMainContent": True,
        "waitFor": 2000,  # Wait 2s for JS rendering
        "timeout": 30000,
        "headers": {"Accept-Language": "en-US"},
    }
)

print(result["markdown"])         # Clean Markdown
print(result["html"])             # Raw HTML
print(result["metadata"]["title"])  # Page title
print(result["links"])            # Extracted links

Crawl (Entire Site)

# Start crawl job
crawl_result = app.crawl_url(
    "https://docs.example.com",
    params={
        "limit": 100,                    # Max pages
        "maxDepth": 3,                   # Max link depth
        "includePaths": ["/docs/*"],     # Only crawl docs
        "excludePaths": ["/blog/*"],     # Skip blog
        "allowBackwardLinks": False,
        "allowExternalLinks": False,
    },
    poll_interval=5  # Check status every 5 seconds
)

# Process results
for page in crawl_result["data"]:
    print(f"URL: {page['metadata']['sourceURL']}")
    print(f"Title: {page['metadata']['title']}")
    print(f"Content: {page['markdown'][:200]}...")
    print("---")

Async Crawl

# Start async crawl (returns job ID)
job = app.async_crawl_url(
    "https://docs.example.com",
    params={"limit": 500, "maxDepth": 5}
)
print(f"Job ID: {job['id']}")

# Check status
status = app.check_crawl_status(job["id"])
print(f"Status: {status['status']}")
print(f"Completed: {status['completed']}/{status['total']}")

# Get results when complete
if status["status"] == "completed":
    for page in status["data"]:
        print(page["metadata"]["sourceURL"])

Map (Discover URLs)

# Discover all URLs on a site
map_result = app.map_url(
    "https://example.com",
    params={
        "search": "pricing",  # Optional: filter by relevance
        "limit": 500
    }
)

for url in map_result["links"]:
    print(url)

Extract (Structured Data)

# Extract structured data using LLM
result = app.scrape_url(
    "https://example.com/product",
    params={
        "formats": ["extract"],
        "extract": {
            "schema": {
                "type": "object",
                "properties": {
                    "product_name": {"type": "string"},
                    "price": {"type": "number"},
                    "description": {"type": "string"},
                    "features": {
                        "type": "array",
                        "items": {"type": "string"}
                    },
                    "rating": {"type": "number"}
                },
                "required": ["product_name", "price"]
            }
        }
    }
)

extracted = result["extract"]
print(f"Product: {extracted['product_name']}")
print(f"Price: ${extracted['price']}")
print(f"Features: {extracted['features']}")

Batch Scrape

# Scrape multiple URLs in parallel
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
]

batch_result = app.batch_scrape_urls(
    urls,
    params={"formats": ["markdown"], "onlyMainContent": True}
)

for page in batch_result["data"]:
    print(f"{page['metadata']['sourceURL']}: {len(page['markdown'])} chars")

REST API

# Scrape
curl -X POST https://api.firecrawl.dev/v1/scrape \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "onlyMainContent": true
  }'

# Crawl
curl -X POST https://api.firecrawl.dev/v1/crawl \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 100,
    "maxDepth": 3
  }'

# Check crawl status
curl https://api.firecrawl.dev/v1/crawl/JOB_ID \
  -H "Authorization: Bearer fc-YOUR_API_KEY"

# Map
curl -X POST https://api.firecrawl.dev/v1/map \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Configuration

Scrape Parameters

Parameter	Type	Description
`formats`	array	Output formats: `markdown`, `html`, `links`, `extract`, `screenshot`
`onlyMainContent`	bool	Remove nav, headers, footers
`includeTags`	array	CSS selectors to include
`excludeTags`	array	CSS selectors to exclude
`waitFor`	int	Milliseconds to wait for JS
`timeout`	int	Request timeout in ms
`headers`	object	Custom HTTP headers
`actions`	array	Browser actions before scraping

Crawl Parameters

Parameter	Type	Description
`limit`	int	Max pages to crawl
`maxDepth`	int	Max link depth from start
`includePaths`	array	URL patterns to include
`excludePaths`	array	URL patterns to exclude
`allowBackwardLinks`	bool	Allow links to parent pages
`allowExternalLinks`	bool	Follow external links
`ignoreSitemap`	bool	Skip sitemap discovery

Browser Actions

# Interact with page before scraping
result = app.scrape_url(
    "https://example.com",
    params={
        "formats": ["markdown"],
        "actions": [
            {"type": "click", "selector": "#load-more"},
            {"type": "wait", "milliseconds": 2000},
            {"type": "scroll", "direction": "down", "amount": 3},
            {"type": "screenshot"},
        ]
    }
)

Advanced Usage

Integration with RAG

from firecrawl import FirecrawlApp
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

app = FirecrawlApp(api_key="fc-YOUR_KEY")

# Crawl documentation site
crawl_result = app.crawl_url(
    "https://docs.example.com",
    params={"limit": 200, "maxDepth": 3, "includePaths": ["/docs/*"]}
)

# Chunk and index
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_chunks = []

for page in crawl_result["data"]:
    chunks = splitter.split_text(page["markdown"])
    for chunk in chunks:
        all_chunks.append({
            "text": chunk,
            "source": page["metadata"]["sourceURL"],
            "title": page["metadata"].get("title", "")
        })

# Store in vector DB
texts = [c["text"] for c in all_chunks]
metadatas = [{"source": c["source"], "title": c["title"]} for c in all_chunks]

vectorstore = Chroma.from_texts(
    texts=texts,
    metadatas=metadatas,
    embedding=OpenAIEmbeddings()
)

print(f"Indexed {len(all_chunks)} chunks from {len(crawl_result['data'])} pages")

Monitoring and Change Detection

import hashlib
import json
from datetime import datetime

def monitor_pages(urls, app, state_file="page_state.json"):
    try:
        with open(state_file) as f:
            state = json.load(f)
    except FileNotFoundError:
        state = {}

    changes = []
    for url in urls:
        result = app.scrape_url(url, params={"formats": ["markdown"]})
        content = result["markdown"]
        content_hash = hashlib.md5(content.encode()).hexdigest()

        if url in state and state[url]["hash"] != content_hash:
            changes.append({
                "url": url,
                "changed_at": datetime.now().isoformat(),
                "old_hash": state[url]["hash"],
                "new_hash": content_hash
            })

        state[url] = {"hash": content_hash, "checked": datetime.now().isoformat()}

    with open(state_file, "w") as f:
        json.dump(state, f)

    return changes

Self-Hosted Configuration

# .env for self-hosted
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
SUPABASE_URL=http://supabase:54321
USE_DB_AUTHENTICATION=false
SCRAPING_BEE_API_KEY=optional
OPENAI_API_KEY=sk-...  # For extract endpoint

Troubleshooting

Issue	Solution
401 Unauthorized	Check API key is valid and has credits
Page returns empty	Site may block scrapers; try `waitFor` param
JavaScript not rendering	Increase `waitFor` timeout (3000-5000ms)
Crawl missing pages	Check `includePaths`, increase `maxDepth`
Rate limited	Reduce concurrent requests, add delays
Timeout errors	Increase `timeout` param, check site availability
Extract returns null	Verify JSON schema matches page content
Self-hosted Redis error	Check Redis connection in `.env`

# Test API key
curl https://api.firecrawl.dev/v1/scrape \
  -H "Authorization: Bearer fc-YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

# Check self-hosted health
curl http://localhost:3002/health

# View crawl logs
docker compose logs -f api