Unstructured Cheat Sheet

Overview

Unstructured is an open-source toolkit for preprocessing and extracting content from documents for machine learning and LLM pipelines. It handles 25+ file formats including PDFs, Word documents, HTML, images, emails, and spreadsheets, extracting clean text, tables, images, and metadata. The library provides partition functions that break documents into typed elements (titles, narrative text, tables, list items) for downstream RAG and NLP applications.

The project offers both a Python library for local processing and a hosted API for production workloads. It supports multiple extraction strategies from fast regex-based parsing to high-quality OCR-based extraction using Tesseract and layout detection models. Unstructured is widely used as the ingestion layer for RAG systems, feeding clean structured content to embedding models and vector databases.

Installation

# Basic install
pip install unstructured

# With PDF support
pip install "unstructured[pdf]"

# With all document types
pip install "unstructured[all-docs]"

# System dependencies for PDF/OCR
# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr poppler-utils libmagic-dev

# macOS
brew install tesseract poppler libmagic

Core Usage

Partition Documents

from unstructured.partition.auto import partition

# Auto-detect file type and partition
elements = partition(filename="document.pdf")

for element in elements:
    print(f"Type: {type(element).__name__}")
    print(f"Text: {element.text[:100]}")
    print(f"Metadata: {element.metadata}")
    print("---")

File-Specific Partitioners

from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html
from unstructured.partition.docx import partition_docx
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.email import partition_email
from unstructured.partition.csv import partition_csv
from unstructured.partition.md import partition_md

# PDF with high-quality extraction
elements = partition_pdf(
    filename="paper.pdf",
    strategy="hi_res",            # fast, ocr_only, hi_res, auto
    infer_table_structure=True,   # Extract table structure
    extract_images_in_pdf=True,   # Extract embedded images
    languages=["eng"]             # OCR language
)

# HTML
elements = partition_html(url="https://example.com/article")

# Word document
elements = partition_docx(filename="report.docx")

# PowerPoint
elements = partition_pptx(filename="presentation.pptx")

# Email (.eml)
elements = partition_email(filename="message.eml")

# Markdown
elements = partition_md(filename="README.md")

Element Types

Element Type	Description
`Title`	Section headers and titles
`NarrativeText`	Body paragraphs
`ListItem`	Bullet or numbered list items
`Table`	Tabular data (with HTML structure)
`Image`	Extracted or referenced images
`FigureCaption`	Image/figure captions
`Header`	Page headers
`Footer`	Page footers
`Address`	Mailing/physical addresses
`EmailAddress`	Email addresses
`PageBreak`	Page break markers
`Formula`	Mathematical formulas

PDF Strategies

Strategy	Speed	Quality	Requirements
`fast`	Fastest	Lower	pdfminer only
`ocr_only`	Slow	Good for scans	Tesseract
`hi_res`	Slowest	Best	Tesseract + detectron2/YOLOX
`auto`	Varies	Adaptive	All dependencies

Chunking

from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

# Chunk by section titles
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1000,
    combine_text_under_n_chars=200,
    multipage_sections=True
)

# Basic chunking
chunks = chunk_elements(
    elements,
    max_characters=1000,
    overlap=200
)

for chunk in chunks:
    print(f"Chunk ({len(chunk.text)} chars): {chunk.text[:80]}...")

Staging and Output

from unstructured.staging.base import elements_to_json, elements_from_json

# Export to JSON
elements_to_json(elements, filename="output.json")

# Load from JSON
loaded = elements_from_json(filename="output.json")

# Convert to dictionaries
dicts = [el.to_dict() for el in elements]

# Convert to DataFrame
import pandas as pd
df = pd.DataFrame([el.to_dict() for el in elements])
print(df[["type", "text"]].head())

Connectors (Ingest)

from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner

# Process directory of documents
runner = LocalRunner(
    processor_config=ProcessorConfig(
        output_dir="./output",
        num_processes=4,
    ),
    read_config=ReadConfig(),
    connector_config=SimpleLocalConfig(
        input_path="./documents/",
        recursive=True,
    ),
)
runner.run()

Cloud Source Connectors

# Process from S3
unstructured-ingest \
  local \
  --input-path s3://bucket/documents/ \
  --output-dir ./output \
  --strategy hi_res \
  --num-processes 4

# Process from Google Drive
unstructured-ingest \
  google-drive \
  --drive-id YOUR_DRIVE_ID \
  --output-dir ./output \
  --service-account-key service_account.json

# Process from Confluence
unstructured-ingest \
  confluence \
  --url https://your-org.atlassian.net \
  --user-email user@example.com \
  --api-token YOUR_TOKEN \
  --output-dir ./output

Destination Connectors

# Ingest to Pinecone
unstructured-ingest \
  local \
  --input-path ./documents/ \
  --output-dir ./output \
  --strategy hi_res \
  --embedding-provider openai \
  --embedding-model text-embedding-3-small \
  pinecone \
  --api-key YOUR_PINECONE_KEY \
  --index-name documents

# Ingest to Weaviate
unstructured-ingest \
  local \
  --input-path ./documents/ \
  --output-dir ./output \
  weaviate \
  --host-url http://localhost:8080 \
  --class-name Documents

Configuration

Environment Variables

export UNSTRUCTURED_API_KEY=your-api-key
export UNSTRUCTURED_API_URL=https://api.unstructured.io
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata

API Client

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared

client = UnstructuredClient(api_key_auth="YOUR_API_KEY")

with open("document.pdf", "rb") as f:
    response = client.general.partition(
        request=shared.PartitionParameters(
            files=shared.Files(content=f.read(), file_name="document.pdf"),
            strategy=shared.Strategy.HI_RES,
            languages=["eng"],
            chunking_strategy="by_title",
            max_characters=1500,
        )
    )

elements = response.elements

Advanced Usage

Table Extraction

elements = partition_pdf("tables.pdf", strategy="hi_res", infer_table_structure=True)

tables = [el for el in elements if el.category == "Table"]
for table in tables:
    print(table.metadata.text_as_html)  # HTML table structure
    print(table.text)                    # Plain text

Metadata Access

for element in elements:
    meta = element.metadata
    print(f"Filename: {meta.filename}")
    print(f"Page: {meta.page_number}")
    print(f"Coordinates: {meta.coordinates}")
    print(f"Languages: {meta.languages}")
    print(f"File type: {meta.filetype}")
    print(f"Parent ID: {meta.parent_id}")

Cleaning Functions

from unstructured.cleaners.core import (
    clean,
    clean_extra_whitespace,
    clean_non_ascii_chars,
    replace_unicode_quotes,
    group_broken_paragraphs,
)

text = "  Some   messy   text  with   extra  spaces  "
cleaned = clean_extra_whitespace(text)
# "Some messy text with extra spaces"

text_with_unicode = "Here’s a “quote”"
cleaned = replace_unicode_quotes(text_with_unicode)
# "Here's a \"quote\""

Troubleshooting

Issue	Solution
`libmagic` not found	Install: `apt install libmagic-dev` or `brew install libmagic`
Tesseract not found	Install: `apt install tesseract-ocr`
Poor PDF extraction	Switch to `strategy="hi_res"`, install detectron2
Table structure missing	Set `infer_table_structure=True`
Slow processing	Use `strategy="fast"` or increase `num_processes`
Out of memory on large PDFs	Process pages in batches, reduce image extraction
OCR language errors	Install language pack: `apt install tesseract-ocr-deu`
Empty elements returned	Check file is not corrupted, try different strategy

# Verify dependencies
python -c "from unstructured.partition.pdf import partition_pdf; print('PDF support OK')"
tesseract --version
pdftotext -v 2>&1 | head -1