Zum Inhalt springen

GROBID Cheat Sheet

Overview

GROBID (GeneRation Of BIbliographic Data) is a machine learning library for parsing and extracting structured information from scholarly and technical documents. It converts raw PDF papers into structured TEI XML, extracting headers (title, authors, affiliations, abstract), bibliographic references, full body text with section hierarchy, figures, tables, and mathematical formulas. GROBID uses a cascade of CRF and deep learning models trained on millions of scientific articles.

The tool is the de facto standard for large-scale scholarly document processing, used by academic search engines, digital libraries, and research analytics platforms. It exposes a REST API for easy integration and can process thousands of documents per hour on a single server.

Installation

docker run --rm -d \
  --name grobid \
  -p 8070:8070 \
  lfoppiano/grobid:0.8.1

# With deep learning models (better accuracy)
docker run --rm -d \
  --name grobid \
  -p 8070:8070 \
  lfoppiano/grobid:0.8.1-full

# Verify
curl http://localhost:8070/api/isalive

From Source

git clone https://github.com/kermitt2/grobid.git
cd grobid
./gradlew clean install

# Start server
./gradlew run
# Server starts at http://localhost:8070

Python Client

pip install grobid_client_python

Core API Endpoints

Process Full Document

# Full document processing (headers + body + references)
curl -v -X POST http://localhost:8070/api/processFulltextDocument \
  -F "input=@paper.pdf" \
  -F "consolidateHeader=1" \
  -F "consolidateCitations=1" \
  -F "includeRawCitations=1" \
  -F "teiCoordinates=persName,figure,ref,formula" \
  -o output.tei.xml

# Header only (fast)
curl -X POST http://localhost:8070/api/processHeaderDocument \
  -F "input=@paper.pdf" \
  -o header.tei.xml

# References only
curl -X POST http://localhost:8070/api/processReferences \
  -F "input=@paper.pdf" \
  -o references.tei.xml

API Endpoints

EndpointDescriptionOutput
/api/processFulltextDocumentFull document parsingTEI XML (complete)
/api/processHeaderDocumentHeader extraction onlyTEI XML (header)
/api/processReferencesCitation extractionTEI XML (references)
/api/processCitationParse raw citation stringTEI XML (bibliStruct)
/api/processDateParse date stringsTEI XML (date)
/api/processAffiliationsParse affiliation stringsTEI XML (affiliation)
/api/referenceAnnotationsAnnotate PDF with ref linksJSON annotations
/api/isaliveHealth checkHTTP 200

Parameters

ParameterDescriptionValues
consolidateHeaderValidate header via CrossRef0 (off), 1 (on), 2 (full)
consolidateCitationsValidate citations via CrossRef0 (off), 1 (on), 2 (full)
includeRawCitationsInclude raw citation strings0, 1
includeRawAffiliationsInclude raw affiliations0, 1
teiCoordinatesAdd bounding box coordsComma-separated element types
segmentSentencesSentence segmentation0, 1

Python Client

from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path="./grobid_config.json")

# Process single PDF
client.process(
    service="processFulltextDocument",
    input_path="./papers/",
    output="./output/",
    consolidate_header=True,
    consolidate_citations=True,
    include_raw_citations=True,
    force=True,
    n=10  # Concurrent requests
)

Config File

{
  "grobid_server": "http://localhost:8070",
  "batch_size": 1000,
  "sleep_time": 5,
  "timeout": 60,
  "coordinates": ["persName", "figure", "ref", "formula", "s"]
}

Parsing TEI XML Output

from lxml import etree

# Parse GROBID TEI XML output
ns = {"tei": "http://www.tei-c.org/ns/1.0"}

tree = etree.parse("output.tei.xml")
root = tree.getroot()

# Extract title
title = root.find(".//tei:titleStmt/tei:title", ns)
print(f"Title: {title.text}")

# Extract abstract
abstract = root.find(".//tei:profileDesc/tei:abstract/tei:p", ns)
print(f"Abstract: {abstract.text}")

# Extract authors
authors = root.findall(".//tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:analytic/tei:author", ns)
for author in authors:
    forename = author.find(".//tei:forename", ns)
    surname = author.find(".//tei:surname", ns)
    if forename is not None and surname is not None:
        print(f"Author: {forename.text} {surname.text}")

# Extract references
refs = root.findall(".//tei:listBibl/tei:biblStruct", ns)
for ref in refs:
    ref_title = ref.find(".//tei:analytic/tei:title", ns)
    if ref_title is not None:
        print(f"Reference: {ref_title.text}")

# Extract body sections
divs = root.findall(".//tei:body/tei:div", ns)
for div in divs:
    head = div.find("tei:head", ns)
    if head is not None:
        print(f"\nSection: {head.text}")
    for p in div.findall("tei:p", ns):
        text = etree.tostring(p, method="text", encoding="unicode")
        print(f"  {text[:100]}...")

Configuration

Server Configuration

# grobid-home/config/grobid.yaml
grobid:
  server:
    port: 8070
    modelPreload: true

  consolidation:
    enabled: true
    service: crossref

  proxy:
    host: null
    port: 0

  pdf:
    pdfalto:
      timeout: 60

  models:
    header:
      engine: delft
      architecture: BidLSTM_CRF
    fulltext:
      engine: delft
      architecture: BidLSTM_CRF
    citation:
      engine: delft
      architecture: BidLSTM_CRF
    segmentation:
      engine: delft
      architecture: BidLSTM_CRF

Docker Resource Limits

# docker-compose.yml
version: '3.8'
services:
  grobid:
    image: lfoppiano/grobid:0.8.1-full
    ports:
      - "8070:8070"
    deploy:
      resources:
        limits:
          memory: 8G
          cpus: '4'
    environment:
      - JAVA_OPTS=-Xmx6g -Xms2g

Advanced Usage

Batch Processing

import os
import glob
from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path="./grobid_config.json")

# Process all PDFs in directory
client.process(
    service="processFulltextDocument",
    input_path="./papers/",
    output="./tei_output/",
    consolidate_header=True,
    consolidate_citations=True,
    n=20,          # 20 concurrent requests
    force=True     # Reprocess existing
)

# Count processed files
tei_files = glob.glob("./tei_output/*.xml")
print(f"Processed {len(tei_files)} documents")

Parse Raw Citation Strings

# Parse a raw citation string
curl -X POST http://localhost:8070/api/processCitation \
  -d "citations=Vaswani et al. Attention Is All You Need. NeurIPS 2017." \
  -H "Content-Type: application/x-www-form-urlencoded"

Extract Structured Metadata

from lxml import etree

def extract_metadata(tei_path):
    ns = {"tei": "http://www.tei-c.org/ns/1.0"}
    tree = etree.parse(tei_path)
    root = tree.getroot()

    metadata = {
        "title": "",
        "authors": [],
        "abstract": "",
        "references": [],
        "sections": []
    }

    # Title
    title_el = root.find(".//tei:titleStmt/tei:title", ns)
    if title_el is not None:
        metadata["title"] = title_el.text or ""

    # Authors
    for author in root.findall(".//tei:analytic/tei:author", ns):
        name_parts = []
        for fn in author.findall(".//tei:forename", ns):
            if fn.text: name_parts.append(fn.text)
        for sn in author.findall(".//tei:surname", ns):
            if sn.text: name_parts.append(sn.text)
        if name_parts:
            metadata["authors"].append(" ".join(name_parts))

    # Abstract
    abstract_el = root.find(".//tei:abstract", ns)
    if abstract_el is not None:
        metadata["abstract"] = etree.tostring(abstract_el, method="text", encoding="unicode").strip()

    return metadata

Troubleshooting

IssueSolution
Server won’t startCheck Java 11+ installed, increase heap: -Xmx4g
Timeout on large PDFsIncrease timeout in config, use processHeaderDocument
Poor extraction qualityUse -full Docker image with deep learning models
OutOfMemoryErrorIncrease Java heap: JAVA_OPTS=-Xmx8g
CrossRef consolidation slowDisable consolidation or use local cache
Empty TEI outputCheck PDF is not corrupted, try different PDF
Port 8070 in useChange port in config or Docker mapping
High CPU under loadLimit concurrent requests, add resource limits
# Health check
curl http://localhost:8070/api/isalive

# View version
curl http://localhost:8070/api/version

# Check Docker logs
docker logs grobid

# Test with sample
curl -X POST http://localhost:8070/api/processHeaderDocument \
  -F "input=@test.pdf" | head -50