GROBID Cheat Sheet

Overview

GROBID (GeneRation Of BIbliographic Data) is a machine learning library for parsing and extracting structured information from scholarly and technical documents. It converts raw PDF papers into structured TEI XML, extracting headers (title, authors, affiliations, abstract), bibliographic references, full body text with section hierarchy, figures, tables, and mathematical formulas. GROBID uses a cascade of CRF and deep learning models trained on millions of scientific articles.

The tool is the de facto standard for large-scale scholarly document processing, used by academic search engines, digital libraries, and research analytics platforms. It exposes a REST API for easy integration and can process thousands of documents per hour on a single server.

Installation

Docker (Recommended)

docker run --rm -d \
  --name grobid \
  -p 8070:8070 \
  lfoppiano/grobid:0.8.1

# With deep learning models (better accuracy)
docker run --rm -d \
  --name grobid \
  -p 8070:8070 \
  lfoppiano/grobid:0.8.1-full

# Verify
curl http://localhost:8070/api/isalive

From Source

git clone https://github.com/kermitt2/grobid.git
cd grobid
./gradlew clean install

# Start server
./gradlew run
# Server starts at http://localhost:8070

Python Client

pip install grobid_client_python

Core API Endpoints

Process Full Document

# Full document processing (headers + body + references)
curl -v -X POST http://localhost:8070/api/processFulltextDocument \
  -F "input=@paper.pdf" \
  -F "consolidateHeader=1" \
  -F "consolidateCitations=1" \
  -F "includeRawCitations=1" \
  -F "teiCoordinates=persName,figure,ref,formula" \
  -o output.tei.xml

# Header only (fast)
curl -X POST http://localhost:8070/api/processHeaderDocument \
  -F "input=@paper.pdf" \
  -o header.tei.xml

# References only
curl -X POST http://localhost:8070/api/processReferences \
  -F "input=@paper.pdf" \
  -o references.tei.xml

API Endpoints

Endpoint	Description	Output
`/api/processFulltextDocument`	Full document parsing	TEI XML (complete)
`/api/processHeaderDocument`	Header extraction only	TEI XML (header)
`/api/processReferences`	Citation extraction	TEI XML (references)
`/api/processCitation`	Parse raw citation string	TEI XML (bibliStruct)
`/api/processDate`	Parse date strings	TEI XML (date)
`/api/processAffiliations`	Parse affiliation strings	TEI XML (affiliation)
`/api/referenceAnnotations`	Annotate PDF with ref links	JSON annotations
`/api/isalive`	Health check	HTTP 200

Parameters

Parameter	Description	Values
`consolidateHeader`	Validate header via CrossRef	0 (off), 1 (on), 2 (full)
`consolidateCitations`	Validate citations via CrossRef	0 (off), 1 (on), 2 (full)
`includeRawCitations`	Include raw citation strings	0, 1
`includeRawAffiliations`	Include raw affiliations	0, 1
`teiCoordinates`	Add bounding box coords	Comma-separated element types
`segmentSentences`	Sentence segmentation	0, 1

Python Client

from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path="./grobid_config.json")

# Process single PDF
client.process(
    service="processFulltextDocument",
    input_path="./papers/",
    output="./output/",
    consolidate_header=True,
    consolidate_citations=True,
    include_raw_citations=True,
    force=True,
    n=10  # Concurrent requests
)

Config File

{
  "grobid_server": "http://localhost:8070",
  "batch_size": 1000,
  "sleep_time": 5,
  "timeout": 60,
  "coordinates": ["persName", "figure", "ref", "formula", "s"]
}

Parsing TEI XML Output

from lxml import etree

# Parse GROBID TEI XML output
ns = {"tei": "http://www.tei-c.org/ns/1.0"}

tree = etree.parse("output.tei.xml")
root = tree.getroot()

# Extract title
title = root.find(".//tei:titleStmt/tei:title", ns)
print(f"Title: {title.text}")

# Extract abstract
abstract = root.find(".//tei:profileDesc/tei:abstract/tei:p", ns)
print(f"Abstract: {abstract.text}")

# Extract authors
authors = root.findall(".//tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:analytic/tei:author", ns)
for author in authors:
    forename = author.find(".//tei:forename", ns)
    surname = author.find(".//tei:surname", ns)
    if forename is not None and surname is not None:
        print(f"Author: {forename.text} {surname.text}")

# Extract references
refs = root.findall(".//tei:listBibl/tei:biblStruct", ns)
for ref in refs:
    ref_title = ref.find(".//tei:analytic/tei:title", ns)
    if ref_title is not None:
        print(f"Reference: {ref_title.text}")

# Extract body sections
divs = root.findall(".//tei:body/tei:div", ns)
for div in divs:
    head = div.find("tei:head", ns)
    if head is not None:
        print(f"\nSection: {head.text}")
    for p in div.findall("tei:p", ns):
        text = etree.tostring(p, method="text", encoding="unicode")
        print(f"  {text[:100]}...")

Configuration

Server Configuration

# grobid-home/config/grobid.yaml
grobid:
  server:
    port: 8070
    modelPreload: true

  consolidation:
    enabled: true
    service: crossref

  proxy:
    host: null
    port: 0

  pdf:
    pdfalto:
      timeout: 60

  models:
    header:
      engine: delft
      architecture: BidLSTM_CRF
    fulltext:
      engine: delft
      architecture: BidLSTM_CRF
    citation:
      engine: delft
      architecture: BidLSTM_CRF
    segmentation:
      engine: delft
      architecture: BidLSTM_CRF

Docker Resource Limits

# docker-compose.yml
version: '3.8'
services:
  grobid:
    image: lfoppiano/grobid:0.8.1-full
    ports:
      - "8070:8070"
    deploy:
      resources:
        limits:
          memory: 8G
          cpus: '4'
    environment:
      - JAVA_OPTS=-Xmx6g -Xms2g

Advanced Usage

Batch Processing

import os
import glob
from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path="./grobid_config.json")

# Process all PDFs in directory
client.process(
    service="processFulltextDocument",
    input_path="./papers/",
    output="./tei_output/",
    consolidate_header=True,
    consolidate_citations=True,
    n=20,          # 20 concurrent requests
    force=True     # Reprocess existing
)

# Count processed files
tei_files = glob.glob("./tei_output/*.xml")
print(f"Processed {len(tei_files)} documents")

Parse Raw Citation Strings

# Parse a raw citation string
curl -X POST http://localhost:8070/api/processCitation \
  -d "citations=Vaswani et al. Attention Is All You Need. NeurIPS 2017." \
  -H "Content-Type: application/x-www-form-urlencoded"

Extract Structured Metadata

from lxml import etree

def extract_metadata(tei_path):
    ns = {"tei": "http://www.tei-c.org/ns/1.0"}
    tree = etree.parse(tei_path)
    root = tree.getroot()

    metadata = {
        "title": "",
        "authors": [],
        "abstract": "",
        "references": [],
        "sections": []
    }

    # Title
    title_el = root.find(".//tei:titleStmt/tei:title", ns)
    if title_el is not None:
        metadata["title"] = title_el.text or ""

    # Authors
    for author in root.findall(".//tei:analytic/tei:author", ns):
        name_parts = []
        for fn in author.findall(".//tei:forename", ns):
            if fn.text: name_parts.append(fn.text)
        for sn in author.findall(".//tei:surname", ns):
            if sn.text: name_parts.append(sn.text)
        if name_parts:
            metadata["authors"].append(" ".join(name_parts))

    # Abstract
    abstract_el = root.find(".//tei:abstract", ns)
    if abstract_el is not None:
        metadata["abstract"] = etree.tostring(abstract_el, method="text", encoding="unicode").strip()

    return metadata

Troubleshooting

Issue	Solution
Server won’t start	Check Java 11+ installed, increase heap: `-Xmx4g`
Timeout on large PDFs	Increase timeout in config, use `processHeaderDocument`
Poor extraction quality	Use `-full` Docker image with deep learning models
OutOfMemoryError	Increase Java heap: `JAVA_OPTS=-Xmx8g`
CrossRef consolidation slow	Disable consolidation or use local cache
Empty TEI output	Check PDF is not corrupted, try different PDF
Port 8070 in use	Change port in config or Docker mapping
High CPU under load	Limit concurrent requests, add resource limits

# Health check
curl http://localhost:8070/api/isalive

# View version
curl http://localhost:8070/api/version

# Check Docker logs
docker logs grobid

# Test with sample
curl -X POST http://localhost:8070/api/processHeaderDocument \
  -F "input=@test.pdf" | head -50