Zum Inhalt springen

Firecrawl Cheat Sheet

Overview

Firecrawl is a web scraping API designed to convert entire websites into clean, LLM-ready Markdown or structured data. Unlike traditional scrapers, Firecrawl handles JavaScript-rendered pages, manages proxies and rate limiting, bypasses common anti-bot measures, and produces clean text output stripped of navigation elements, ads, and boilerplate. It provides scrape (single page), crawl (entire site), map (discover URLs), and extract (structured data) endpoints.

The tool is purpose-built for AI applications: populating RAG knowledge bases, creating training datasets, monitoring competitor content, and extracting structured information from web pages. Firecrawl offers both a hosted API and a self-hosted open-source version.

Installation

Python SDK

pip install firecrawl-py

Node.js SDK

npm install @mendable/firecrawl-js

Self-Hosted

git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
cp .env.example .env
# Edit .env with configuration

docker compose up -d
# API at http://localhost:3002

Core Operations

Scrape (Single Page)

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Basic scrape to Markdown
result = app.scrape_url("https://example.com/article")
print(result["markdown"])

# Scrape with options
result = app.scrape_url(
    "https://example.com/article",
    params={
        "formats": ["markdown", "html", "links"],
        "onlyMainContent": True,
        "waitFor": 2000,  # Wait 2s for JS rendering
        "timeout": 30000,
        "headers": {"Accept-Language": "en-US"},
    }
)

print(result["markdown"])         # Clean Markdown
print(result["html"])             # Raw HTML
print(result["metadata"]["title"])  # Page title
print(result["links"])            # Extracted links

Crawl (Entire Site)

# Start crawl job
crawl_result = app.crawl_url(
    "https://docs.example.com",
    params={
        "limit": 100,                    # Max pages
        "maxDepth": 3,                   # Max link depth
        "includePaths": ["/docs/*"],     # Only crawl docs
        "excludePaths": ["/blog/*"],     # Skip blog
        "allowBackwardLinks": False,
        "allowExternalLinks": False,
    },
    poll_interval=5  # Check status every 5 seconds
)

# Process results
for page in crawl_result["data"]:
    print(f"URL: {page['metadata']['sourceURL']}")
    print(f"Title: {page['metadata']['title']}")
    print(f"Content: {page['markdown'][:200]}...")
    print("---")

Async Crawl

# Start async crawl (returns job ID)
job = app.async_crawl_url(
    "https://docs.example.com",
    params={"limit": 500, "maxDepth": 5}
)
print(f"Job ID: {job['id']}")

# Check status
status = app.check_crawl_status(job["id"])
print(f"Status: {status['status']}")
print(f"Completed: {status['completed']}/{status['total']}")

# Get results when complete
if status["status"] == "completed":
    for page in status["data"]:
        print(page["metadata"]["sourceURL"])

Map (Discover URLs)

# Discover all URLs on a site
map_result = app.map_url(
    "https://example.com",
    params={
        "search": "pricing",  # Optional: filter by relevance
        "limit": 500
    }
)

for url in map_result["links"]:
    print(url)

Extract (Structured Data)

# Extract structured data using LLM
result = app.scrape_url(
    "https://example.com/product",
    params={
        "formats": ["extract"],
        "extract": {
            "schema": {
                "type": "object",
                "properties": {
                    "product_name": {"type": "string"},
                    "price": {"type": "number"},
                    "description": {"type": "string"},
                    "features": {
                        "type": "array",
                        "items": {"type": "string"}
                    },
                    "rating": {"type": "number"}
                },
                "required": ["product_name", "price"]
            }
        }
    }
)

extracted = result["extract"]
print(f"Product: {extracted['product_name']}")
print(f"Price: ${extracted['price']}")
print(f"Features: {extracted['features']}")

Batch Scrape

# Scrape multiple URLs in parallel
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
]

batch_result = app.batch_scrape_urls(
    urls,
    params={"formats": ["markdown"], "onlyMainContent": True}
)

for page in batch_result["data"]:
    print(f"{page['metadata']['sourceURL']}: {len(page['markdown'])} chars")

REST API

# Scrape
curl -X POST https://api.firecrawl.dev/v1/scrape \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "onlyMainContent": true
  }'

# Crawl
curl -X POST https://api.firecrawl.dev/v1/crawl \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 100,
    "maxDepth": 3
  }'

# Check crawl status
curl https://api.firecrawl.dev/v1/crawl/JOB_ID \
  -H "Authorization: Bearer fc-YOUR_API_KEY"

# Map
curl -X POST https://api.firecrawl.dev/v1/map \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Configuration

Scrape Parameters

ParameterTypeDescription
formatsarrayOutput formats: markdown, html, links, extract, screenshot
onlyMainContentboolRemove nav, headers, footers
includeTagsarrayCSS selectors to include
excludeTagsarrayCSS selectors to exclude
waitForintMilliseconds to wait for JS
timeoutintRequest timeout in ms
headersobjectCustom HTTP headers
actionsarrayBrowser actions before scraping

Crawl Parameters

ParameterTypeDescription
limitintMax pages to crawl
maxDepthintMax link depth from start
includePathsarrayURL patterns to include
excludePathsarrayURL patterns to exclude
allowBackwardLinksboolAllow links to parent pages
allowExternalLinksboolFollow external links
ignoreSitemapboolSkip sitemap discovery

Browser Actions

# Interact with page before scraping
result = app.scrape_url(
    "https://example.com",
    params={
        "formats": ["markdown"],
        "actions": [
            {"type": "click", "selector": "#load-more"},
            {"type": "wait", "milliseconds": 2000},
            {"type": "scroll", "direction": "down", "amount": 3},
            {"type": "screenshot"},
        ]
    }
)

Advanced Usage

Integration with RAG

from firecrawl import FirecrawlApp
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

app = FirecrawlApp(api_key="fc-YOUR_KEY")

# Crawl documentation site
crawl_result = app.crawl_url(
    "https://docs.example.com",
    params={"limit": 200, "maxDepth": 3, "includePaths": ["/docs/*"]}
)

# Chunk and index
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_chunks = []

for page in crawl_result["data"]:
    chunks = splitter.split_text(page["markdown"])
    for chunk in chunks:
        all_chunks.append({
            "text": chunk,
            "source": page["metadata"]["sourceURL"],
            "title": page["metadata"].get("title", "")
        })

# Store in vector DB
texts = [c["text"] for c in all_chunks]
metadatas = [{"source": c["source"], "title": c["title"]} for c in all_chunks]

vectorstore = Chroma.from_texts(
    texts=texts,
    metadatas=metadatas,
    embedding=OpenAIEmbeddings()
)

print(f"Indexed {len(all_chunks)} chunks from {len(crawl_result['data'])} pages")

Monitoring and Change Detection

import hashlib
import json
from datetime import datetime

def monitor_pages(urls, app, state_file="page_state.json"):
    try:
        with open(state_file) as f:
            state = json.load(f)
    except FileNotFoundError:
        state = {}

    changes = []
    for url in urls:
        result = app.scrape_url(url, params={"formats": ["markdown"]})
        content = result["markdown"]
        content_hash = hashlib.md5(content.encode()).hexdigest()

        if url in state and state[url]["hash"] != content_hash:
            changes.append({
                "url": url,
                "changed_at": datetime.now().isoformat(),
                "old_hash": state[url]["hash"],
                "new_hash": content_hash
            })

        state[url] = {"hash": content_hash, "checked": datetime.now().isoformat()}

    with open(state_file, "w") as f:
        json.dump(state, f)

    return changes

Self-Hosted Configuration

# .env for self-hosted
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
SUPABASE_URL=http://supabase:54321
USE_DB_AUTHENTICATION=false
SCRAPING_BEE_API_KEY=optional
OPENAI_API_KEY=sk-...  # For extract endpoint

Troubleshooting

IssueSolution
401 UnauthorizedCheck API key is valid and has credits
Page returns emptySite may block scrapers; try waitFor param
JavaScript not renderingIncrease waitFor timeout (3000-5000ms)
Crawl missing pagesCheck includePaths, increase maxDepth
Rate limitedReduce concurrent requests, add delays
Timeout errorsIncrease timeout param, check site availability
Extract returns nullVerify JSON schema matches page content
Self-hosted Redis errorCheck Redis connection in .env
# Test API key
curl https://api.firecrawl.dev/v1/scrape \
  -H "Authorization: Bearer fc-YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

# Check self-hosted health
curl http://localhost:3002/health

# View crawl logs
docker compose logs -f api