コンテンツにスキップ

Unstructured Cheat Sheet

Overview

Unstructured is an open-source toolkit for preprocessing and extracting content from documents for machine learning and LLM pipelines. It handles 25+ file formats including PDFs, Word documents, HTML, images, emails, and spreadsheets, extracting clean text, tables, images, and metadata. The library provides partition functions that break documents into typed elements (titles, narrative text, tables, list items) for downstream RAG and NLP applications.

The project offers both a Python library for local processing and a hosted API for production workloads. It supports multiple extraction strategies from fast regex-based parsing to high-quality OCR-based extraction using Tesseract and layout detection models. Unstructured is widely used as the ingestion layer for RAG systems, feeding clean structured content to embedding models and vector databases.

Installation

# Basic install
pip install unstructured

# With PDF support
pip install "unstructured[pdf]"

# With all document types
pip install "unstructured[all-docs]"

# System dependencies for PDF/OCR
# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr poppler-utils libmagic-dev

# macOS
brew install tesseract poppler libmagic

Core Usage

Partition Documents

from unstructured.partition.auto import partition

# Auto-detect file type and partition
elements = partition(filename="document.pdf")

for element in elements:
    print(f"Type: {type(element).__name__}")
    print(f"Text: {element.text[:100]}")
    print(f"Metadata: {element.metadata}")
    print("---")

File-Specific Partitioners

from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html
from unstructured.partition.docx import partition_docx
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.email import partition_email
from unstructured.partition.csv import partition_csv
from unstructured.partition.md import partition_md

# PDF with high-quality extraction
elements = partition_pdf(
    filename="paper.pdf",
    strategy="hi_res",            # fast, ocr_only, hi_res, auto
    infer_table_structure=True,   # Extract table structure
    extract_images_in_pdf=True,   # Extract embedded images
    languages=["eng"]             # OCR language
)

# HTML
elements = partition_html(url="https://example.com/article")

# Word document
elements = partition_docx(filename="report.docx")

# PowerPoint
elements = partition_pptx(filename="presentation.pptx")

# Email (.eml)
elements = partition_email(filename="message.eml")

# Markdown
elements = partition_md(filename="README.md")

Element Types

Element TypeDescription
TitleSection headers and titles
NarrativeTextBody paragraphs
ListItemBullet or numbered list items
TableTabular data (with HTML structure)
ImageExtracted or referenced images
FigureCaptionImage/figure captions
HeaderPage headers
FooterPage footers
AddressMailing/physical addresses
EmailAddressEmail addresses
PageBreakPage break markers
FormulaMathematical formulas

PDF Strategies

StrategySpeedQualityRequirements
fastFastestLowerpdfminer only
ocr_onlySlowGood for scansTesseract
hi_resSlowestBestTesseract + detectron2/YOLOX
autoVariesAdaptiveAll dependencies

Chunking

from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

# Chunk by section titles
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1000,
    combine_text_under_n_chars=200,
    multipage_sections=True
)

# Basic chunking
chunks = chunk_elements(
    elements,
    max_characters=1000,
    overlap=200
)

for chunk in chunks:
    print(f"Chunk ({len(chunk.text)} chars): {chunk.text[:80]}...")

Staging and Output

from unstructured.staging.base import elements_to_json, elements_from_json

# Export to JSON
elements_to_json(elements, filename="output.json")

# Load from JSON
loaded = elements_from_json(filename="output.json")

# Convert to dictionaries
dicts = [el.to_dict() for el in elements]

# Convert to DataFrame
import pandas as pd
df = pd.DataFrame([el.to_dict() for el in elements])
print(df[["type", "text"]].head())

Connectors (Ingest)

from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner

# Process directory of documents
runner = LocalRunner(
    processor_config=ProcessorConfig(
        output_dir="./output",
        num_processes=4,
    ),
    read_config=ReadConfig(),
    connector_config=SimpleLocalConfig(
        input_path="./documents/",
        recursive=True,
    ),
)
runner.run()

Cloud Source Connectors

# Process from S3
unstructured-ingest \
  local \
  --input-path s3://bucket/documents/ \
  --output-dir ./output \
  --strategy hi_res \
  --num-processes 4

# Process from Google Drive
unstructured-ingest \
  google-drive \
  --drive-id YOUR_DRIVE_ID \
  --output-dir ./output \
  --service-account-key service_account.json

# Process from Confluence
unstructured-ingest \
  confluence \
  --url https://your-org.atlassian.net \
  --user-email user@example.com \
  --api-token YOUR_TOKEN \
  --output-dir ./output

Destination Connectors

# Ingest to Pinecone
unstructured-ingest \
  local \
  --input-path ./documents/ \
  --output-dir ./output \
  --strategy hi_res \
  --embedding-provider openai \
  --embedding-model text-embedding-3-small \
  pinecone \
  --api-key YOUR_PINECONE_KEY \
  --index-name documents

# Ingest to Weaviate
unstructured-ingest \
  local \
  --input-path ./documents/ \
  --output-dir ./output \
  weaviate \
  --host-url http://localhost:8080 \
  --class-name Documents

Configuration

Environment Variables

export UNSTRUCTURED_API_KEY=your-api-key
export UNSTRUCTURED_API_URL=https://api.unstructured.io
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata

API Client

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared

client = UnstructuredClient(api_key_auth="YOUR_API_KEY")

with open("document.pdf", "rb") as f:
    response = client.general.partition(
        request=shared.PartitionParameters(
            files=shared.Files(content=f.read(), file_name="document.pdf"),
            strategy=shared.Strategy.HI_RES,
            languages=["eng"],
            chunking_strategy="by_title",
            max_characters=1500,
        )
    )

elements = response.elements

Advanced Usage

Table Extraction

elements = partition_pdf("tables.pdf", strategy="hi_res", infer_table_structure=True)

tables = [el for el in elements if el.category == "Table"]
for table in tables:
    print(table.metadata.text_as_html)  # HTML table structure
    print(table.text)                    # Plain text

Metadata Access

for element in elements:
    meta = element.metadata
    print(f"Filename: {meta.filename}")
    print(f"Page: {meta.page_number}")
    print(f"Coordinates: {meta.coordinates}")
    print(f"Languages: {meta.languages}")
    print(f"File type: {meta.filetype}")
    print(f"Parent ID: {meta.parent_id}")

Cleaning Functions

from unstructured.cleaners.core import (
    clean,
    clean_extra_whitespace,
    clean_non_ascii_chars,
    replace_unicode_quotes,
    group_broken_paragraphs,
)

text = "  Some   messy   text  with   extra  spaces  "
cleaned = clean_extra_whitespace(text)
# "Some messy text with extra spaces"

text_with_unicode = "Here’s a “quote”"
cleaned = replace_unicode_quotes(text_with_unicode)
# "Here's a \"quote\""

Troubleshooting

IssueSolution
libmagic not foundInstall: apt install libmagic-dev or brew install libmagic
Tesseract not foundInstall: apt install tesseract-ocr
Poor PDF extractionSwitch to strategy="hi_res", install detectron2
Table structure missingSet infer_table_structure=True
Slow processingUse strategy="fast" or increase num_processes
Out of memory on large PDFsProcess pages in batches, reduce image extraction
OCR language errorsInstall language pack: apt install tesseract-ocr-deu
Empty elements returnedCheck file is not corrupted, try different strategy
# Verify dependencies
python -c "from unstructured.partition.pdf import partition_pdf; print('PDF support OK')"
tesseract --version
pdftotext -v 2>&1 | head -1