Overview
GROBID (GeneRation Of BIbliographic Data) is a machine learning library for parsing and extracting structured information from scholarly and technical documents. It converts raw PDF papers into structured TEI XML, extracting headers (title, authors, affiliations, abstract), bibliographic references, full body text with section hierarchy, figures, tables, and mathematical formulas. GROBID uses a cascade of CRF and deep learning models trained on millions of scientific articles.
The tool is the de facto standard for large-scale scholarly document processing, used by academic search engines, digital libraries, and research analytics platforms. It exposes a REST API for easy integration and can process thousands of documents per hour on a single server.
Installation
Docker (Recommended)
docker run --rm -d \
--name grobid \
-p 8070:8070 \
lfoppiano/grobid:0.8.1
# With deep learning models (better accuracy)
docker run --rm -d \
--name grobid \
-p 8070:8070 \
lfoppiano/grobid:0.8.1-full
# Verify
curl http://localhost:8070/api/isalive
From Source
git clone https://github.com/kermitt2/grobid.git
cd grobid
./gradlew clean install
# Start server
./gradlew run
# Server starts at http://localhost:8070
Python Client
pip install grobid_client_python
Core API Endpoints
Process Full Document
# Full document processing (headers + body + references)
curl -v -X POST http://localhost:8070/api/processFulltextDocument \
-F "input=@paper.pdf" \
-F "consolidateHeader=1" \
-F "consolidateCitations=1" \
-F "includeRawCitations=1" \
-F "teiCoordinates=persName,figure,ref,formula" \
-o output.tei.xml
# Header only (fast)
curl -X POST http://localhost:8070/api/processHeaderDocument \
-F "input=@paper.pdf" \
-o header.tei.xml
# References only
curl -X POST http://localhost:8070/api/processReferences \
-F "input=@paper.pdf" \
-o references.tei.xml
API Endpoints
| Endpoint | Description | Output |
|---|
/api/processFulltextDocument | Full document parsing | TEI XML (complete) |
/api/processHeaderDocument | Header extraction only | TEI XML (header) |
/api/processReferences | Citation extraction | TEI XML (references) |
/api/processCitation | Parse raw citation string | TEI XML (bibliStruct) |
/api/processDate | Parse date strings | TEI XML (date) |
/api/processAffiliations | Parse affiliation strings | TEI XML (affiliation) |
/api/referenceAnnotations | Annotate PDF with ref links | JSON annotations |
/api/isalive | Health check | HTTP 200 |
Parameters
| Parameter | Description | Values |
|---|
consolidateHeader | Validate header via CrossRef | 0 (off), 1 (on), 2 (full) |
consolidateCitations | Validate citations via CrossRef | 0 (off), 1 (on), 2 (full) |
includeRawCitations | Include raw citation strings | 0, 1 |
includeRawAffiliations | Include raw affiliations | 0, 1 |
teiCoordinates | Add bounding box coords | Comma-separated element types |
segmentSentences | Sentence segmentation | 0, 1 |
Python Client
from grobid_client.grobid_client import GrobidClient
client = GrobidClient(config_path="./grobid_config.json")
# Process single PDF
client.process(
service="processFulltextDocument",
input_path="./papers/",
output="./output/",
consolidate_header=True,
consolidate_citations=True,
include_raw_citations=True,
force=True,
n=10 # Concurrent requests
)
Config File
{
"grobid_server": "http://localhost:8070",
"batch_size": 1000,
"sleep_time": 5,
"timeout": 60,
"coordinates": ["persName", "figure", "ref", "formula", "s"]
}
Parsing TEI XML Output
from lxml import etree
# Parse GROBID TEI XML output
ns = {"tei": "http://www.tei-c.org/ns/1.0"}
tree = etree.parse("output.tei.xml")
root = tree.getroot()
# Extract title
title = root.find(".//tei:titleStmt/tei:title", ns)
print(f"Title: {title.text}")
# Extract abstract
abstract = root.find(".//tei:profileDesc/tei:abstract/tei:p", ns)
print(f"Abstract: {abstract.text}")
# Extract authors
authors = root.findall(".//tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:analytic/tei:author", ns)
for author in authors:
forename = author.find(".//tei:forename", ns)
surname = author.find(".//tei:surname", ns)
if forename is not None and surname is not None:
print(f"Author: {forename.text} {surname.text}")
# Extract references
refs = root.findall(".//tei:listBibl/tei:biblStruct", ns)
for ref in refs:
ref_title = ref.find(".//tei:analytic/tei:title", ns)
if ref_title is not None:
print(f"Reference: {ref_title.text}")
# Extract body sections
divs = root.findall(".//tei:body/tei:div", ns)
for div in divs:
head = div.find("tei:head", ns)
if head is not None:
print(f"\nSection: {head.text}")
for p in div.findall("tei:p", ns):
text = etree.tostring(p, method="text", encoding="unicode")
print(f" {text[:100]}...")
Configuration
Server Configuration
# grobid-home/config/grobid.yaml
grobid:
server:
port: 8070
modelPreload: true
consolidation:
enabled: true
service: crossref
proxy:
host: null
port: 0
pdf:
pdfalto:
timeout: 60
models:
header:
engine: delft
architecture: BidLSTM_CRF
fulltext:
engine: delft
architecture: BidLSTM_CRF
citation:
engine: delft
architecture: BidLSTM_CRF
segmentation:
engine: delft
architecture: BidLSTM_CRF
Docker Resource Limits
# docker-compose.yml
version: '3.8'
services:
grobid:
image: lfoppiano/grobid:0.8.1-full
ports:
- "8070:8070"
deploy:
resources:
limits:
memory: 8G
cpus: '4'
environment:
- JAVA_OPTS=-Xmx6g -Xms2g
Advanced Usage
Batch Processing
import os
import glob
from grobid_client.grobid_client import GrobidClient
client = GrobidClient(config_path="./grobid_config.json")
# Process all PDFs in directory
client.process(
service="processFulltextDocument",
input_path="./papers/",
output="./tei_output/",
consolidate_header=True,
consolidate_citations=True,
n=20, # 20 concurrent requests
force=True # Reprocess existing
)
# Count processed files
tei_files = glob.glob("./tei_output/*.xml")
print(f"Processed {len(tei_files)} documents")
Parse Raw Citation Strings
# Parse a raw citation string
curl -X POST http://localhost:8070/api/processCitation \
-d "citations=Vaswani et al. Attention Is All You Need. NeurIPS 2017." \
-H "Content-Type: application/x-www-form-urlencoded"
from lxml import etree
def extract_metadata(tei_path):
ns = {"tei": "http://www.tei-c.org/ns/1.0"}
tree = etree.parse(tei_path)
root = tree.getroot()
metadata = {
"title": "",
"authors": [],
"abstract": "",
"references": [],
"sections": []
}
# Title
title_el = root.find(".//tei:titleStmt/tei:title", ns)
if title_el is not None:
metadata["title"] = title_el.text or ""
# Authors
for author in root.findall(".//tei:analytic/tei:author", ns):
name_parts = []
for fn in author.findall(".//tei:forename", ns):
if fn.text: name_parts.append(fn.text)
for sn in author.findall(".//tei:surname", ns):
if sn.text: name_parts.append(sn.text)
if name_parts:
metadata["authors"].append(" ".join(name_parts))
# Abstract
abstract_el = root.find(".//tei:abstract", ns)
if abstract_el is not None:
metadata["abstract"] = etree.tostring(abstract_el, method="text", encoding="unicode").strip()
return metadata
Troubleshooting
| Issue | Solution |
|---|
| Server won’t start | Check Java 11+ installed, increase heap: -Xmx4g |
| Timeout on large PDFs | Increase timeout in config, use processHeaderDocument |
| Poor extraction quality | Use -full Docker image with deep learning models |
| OutOfMemoryError | Increase Java heap: JAVA_OPTS=-Xmx8g |
| CrossRef consolidation slow | Disable consolidation or use local cache |
| Empty TEI output | Check PDF is not corrupted, try different PDF |
| Port 8070 in use | Change port in config or Docker mapping |
| High CPU under load | Limit concurrent requests, add resource limits |
# Health check
curl http://localhost:8070/api/isalive
# View version
curl http://localhost:8070/api/version
# Check Docker logs
docker logs grobid
# Test with sample
curl -X POST http://localhost:8070/api/processHeaderDocument \
-F "input=@test.pdf" | head -50