Overview
Unstructured is an open-source toolkit for preprocessing and extracting content from documents for machine learning and LLM pipelines. It handles 25+ file formats including PDFs, Word documents, HTML, images, emails, and spreadsheets, extracting clean text, tables, images, and metadata. The library provides partition functions that break documents into typed elements (titles, narrative text, tables, list items) for downstream RAG and NLP applications.
The project offers both a Python library for local processing and a hosted API for production workloads. It supports multiple extraction strategies from fast regex-based parsing to high-quality OCR-based extraction using Tesseract and layout detection models. Unstructured is widely used as the ingestion layer for RAG systems, feeding clean structured content to embedding models and vector databases.
Installation
# Basic install
pip install unstructured
# With PDF support
pip install "unstructured[pdf]"
# With all document types
pip install "unstructured[all-docs]"
# System dependencies for PDF/OCR
# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr poppler-utils libmagic-dev
# macOS
brew install tesseract poppler libmagic
Core Usage
Partition Documents
from unstructured.partition.auto import partition
# Auto-detect file type and partition
elements = partition(filename="document.pdf")
for element in elements:
print(f"Type: {type(element).__name__}")
print(f"Text: {element.text[:100]}")
print(f"Metadata: {element.metadata}")
print("---")
File-Specific Partitioners
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html
from unstructured.partition.docx import partition_docx
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.email import partition_email
from unstructured.partition.csv import partition_csv
from unstructured.partition.md import partition_md
# PDF with high-quality extraction
elements = partition_pdf(
filename="paper.pdf",
strategy="hi_res", # fast, ocr_only, hi_res, auto
infer_table_structure=True, # Extract table structure
extract_images_in_pdf=True, # Extract embedded images
languages=["eng"] # OCR language
)
# HTML
elements = partition_html(url="https://example.com/article")
# Word document
elements = partition_docx(filename="report.docx")
# PowerPoint
elements = partition_pptx(filename="presentation.pptx")
# Email (.eml)
elements = partition_email(filename="message.eml")
# Markdown
elements = partition_md(filename="README.md")
Element Types
| Element Type | Description |
|---|
Title | Section headers and titles |
NarrativeText | Body paragraphs |
ListItem | Bullet or numbered list items |
Table | Tabular data (with HTML structure) |
Image | Extracted or referenced images |
FigureCaption | Image/figure captions |
Header | Page headers |
Footer | Page footers |
Address | Mailing/physical addresses |
EmailAddress | Email addresses |
PageBreak | Page break markers |
Formula | Mathematical formulas |
PDF Strategies
| Strategy | Speed | Quality | Requirements |
|---|
fast | Fastest | Lower | pdfminer only |
ocr_only | Slow | Good for scans | Tesseract |
hi_res | Slowest | Best | Tesseract + detectron2/YOLOX |
auto | Varies | Adaptive | All dependencies |
Chunking
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements
# Chunk by section titles
chunks = chunk_by_title(
elements,
max_characters=1500,
new_after_n_chars=1000,
combine_text_under_n_chars=200,
multipage_sections=True
)
# Basic chunking
chunks = chunk_elements(
elements,
max_characters=1000,
overlap=200
)
for chunk in chunks:
print(f"Chunk ({len(chunk.text)} chars): {chunk.text[:80]}...")
Staging and Output
from unstructured.staging.base import elements_to_json, elements_from_json
# Export to JSON
elements_to_json(elements, filename="output.json")
# Load from JSON
loaded = elements_from_json(filename="output.json")
# Convert to dictionaries
dicts = [el.to_dict() for el in elements]
# Convert to DataFrame
import pandas as pd
df = pd.DataFrame([el.to_dict() for el in elements])
print(df[["type", "text"]].head())
Connectors (Ingest)
from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner
# Process directory of documents
runner = LocalRunner(
processor_config=ProcessorConfig(
output_dir="./output",
num_processes=4,
),
read_config=ReadConfig(),
connector_config=SimpleLocalConfig(
input_path="./documents/",
recursive=True,
),
)
runner.run()
Cloud Source Connectors
# Process from S3
unstructured-ingest \
local \
--input-path s3://bucket/documents/ \
--output-dir ./output \
--strategy hi_res \
--num-processes 4
# Process from Google Drive
unstructured-ingest \
google-drive \
--drive-id YOUR_DRIVE_ID \
--output-dir ./output \
--service-account-key service_account.json
# Process from Confluence
unstructured-ingest \
confluence \
--url https://your-org.atlassian.net \
--user-email user@example.com \
--api-token YOUR_TOKEN \
--output-dir ./output
Destination Connectors
# Ingest to Pinecone
unstructured-ingest \
local \
--input-path ./documents/ \
--output-dir ./output \
--strategy hi_res \
--embedding-provider openai \
--embedding-model text-embedding-3-small \
pinecone \
--api-key YOUR_PINECONE_KEY \
--index-name documents
# Ingest to Weaviate
unstructured-ingest \
local \
--input-path ./documents/ \
--output-dir ./output \
weaviate \
--host-url http://localhost:8080 \
--class-name Documents
Configuration
Environment Variables
export UNSTRUCTURED_API_KEY=your-api-key
export UNSTRUCTURED_API_URL=https://api.unstructured.io
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
API Client
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
client = UnstructuredClient(api_key_auth="YOUR_API_KEY")
with open("document.pdf", "rb") as f:
response = client.general.partition(
request=shared.PartitionParameters(
files=shared.Files(content=f.read(), file_name="document.pdf"),
strategy=shared.Strategy.HI_RES,
languages=["eng"],
chunking_strategy="by_title",
max_characters=1500,
)
)
elements = response.elements
Advanced Usage
elements = partition_pdf("tables.pdf", strategy="hi_res", infer_table_structure=True)
tables = [el for el in elements if el.category == "Table"]
for table in tables:
print(table.metadata.text_as_html) # HTML table structure
print(table.text) # Plain text
for element in elements:
meta = element.metadata
print(f"Filename: {meta.filename}")
print(f"Page: {meta.page_number}")
print(f"Coordinates: {meta.coordinates}")
print(f"Languages: {meta.languages}")
print(f"File type: {meta.filetype}")
print(f"Parent ID: {meta.parent_id}")
Cleaning Functions
from unstructured.cleaners.core import (
clean,
clean_extra_whitespace,
clean_non_ascii_chars,
replace_unicode_quotes,
group_broken_paragraphs,
)
text = " Some messy text with extra spaces "
cleaned = clean_extra_whitespace(text)
# "Some messy text with extra spaces"
text_with_unicode = "Here’s a “quote”"
cleaned = replace_unicode_quotes(text_with_unicode)
# "Here's a \"quote\""
Troubleshooting
| Issue | Solution |
|---|
libmagic not found | Install: apt install libmagic-dev or brew install libmagic |
| Tesseract not found | Install: apt install tesseract-ocr |
| Poor PDF extraction | Switch to strategy="hi_res", install detectron2 |
| Table structure missing | Set infer_table_structure=True |
| Slow processing | Use strategy="fast" or increase num_processes |
| Out of memory on large PDFs | Process pages in batches, reduce image extraction |
| OCR language errors | Install language pack: apt install tesseract-ocr-deu |
| Empty elements returned | Check file is not corrupted, try different strategy |
# Verify dependencies
python -c "from unstructured.partition.pdf import partition_pdf; print('PDF support OK')"
tesseract --version
pdftotext -v 2>&1 | head -1