MarkItDown Cheat Sheet
Overview
MarkItDown is a lightweight Python utility by Microsoft that converts various file formats to Markdown text. It supports Office documents (Word, Excel, PowerPoint), PDFs, images (with optional LLM-based descriptions), audio (with speech-to-text), HTML, and other formats. The tool is designed for preprocessing documents for LLM consumption, RAG pipelines, and content indexing.
Unlike heavy document parsing libraries, MarkItDown focuses on simplicity and speed, producing clean Markdown output with minimal dependencies. It handles the most common business document formats and can be used as a Python library or command-line tool.
Installation
pip install markitdown
# With optional dependencies
pip install "markitdown[all]"
# With image description support (requires LLM)
pip install "markitdown[llm]"
# With audio transcription
pip install "markitdown[audio]"
Core Usage
Command Line
# Convert file to Markdown
markitdown document.docx > output.md
# Convert PDF
markitdown report.pdf > report.md
# Convert PowerPoint
markitdown presentation.pptx > slides.md
# Convert Excel
markitdown data.xlsx > data.md
# Convert HTML
markitdown page.html > page.md
# Convert from URL
markitdown https://example.com/article > article.md
# Convert image (basic)
markitdown screenshot.png > description.md
# Pipe input
cat document.docx | markitdown > output.md
Python API
from markitdown import MarkItDown
md = MarkItDown()
# Convert file
result = md.convert("report.pdf")
print(result.text_content)
# Convert Word document
result = md.convert("proposal.docx")
print(result.text_content)
# Convert Excel
result = md.convert("data.xlsx")
print(result.text_content)
# Convert PowerPoint
result = md.convert("deck.pptx")
print(result.text_content)
# Convert HTML
result = md.convert("page.html")
print(result.text_content)
# Convert from URL
result = md.convert("https://example.com/article")
print(result.text_content)
Supported Formats
| Format | Extension | Notes |
|---|---|---|
.pdf | Text extraction (no OCR) | |
| Word | .docx | Full formatting preserved |
| Excel | .xlsx, .xls | Tables to Markdown tables |
| PowerPoint | .pptx | Slides with speaker notes |
| HTML | .html, .htm | Clean text extraction |
| CSV | .csv | Converted to Markdown table |
| JSON | .json | Formatted output |
| XML | .xml | Text extraction |
| Images | .jpg, .png, .gif | Optional LLM description |
| Audio | .mp3, .wav | Speech-to-text transcription |
| ZIP | .zip | Processes contained files |
| Plain text | .txt, .md, .rst | Pass-through or conversion |
LLM-Powered Features
Image Description
from markitdown import MarkItDown
from openai import OpenAI
# With OpenAI for image descriptions
client = OpenAI(api_key="sk-...")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
# Convert image with AI description
result = md.convert("diagram.png")
print(result.text_content)
# Output: "A flowchart showing the data pipeline architecture..."
Audio Transcription
from markitdown import MarkItDown
md = MarkItDown()
# Transcribe audio file
result = md.convert("meeting.mp3")
print(result.text_content)
# Output: Transcribed text from the audio
Output Examples
Word Document Output
# Project Proposal
## Executive Summary
This document outlines the proposed system architecture...
## Requirements
1. High availability (99.9% uptime)
2. Sub-100ms response latency
3. Support for 10,000 concurrent users
### Technical Requirements
| Component | Specification | Priority |
|-----------|--------------|----------|
| Database | PostgreSQL 16 | High |
| Cache | Redis 7.x | High |
| Queue | RabbitMQ | Medium |
Excel Output
## Sheet1
| Name | Department | Salary | Start Date |
|------|-----------|--------|------------|
| Alice | Engineering | 120000 | 2023-01-15 |
| Bob | Marketing | 95000 | 2022-06-01 |
| Carol | Engineering | 115000 | 2023-03-20 |
## Sheet2
| Quarter | Revenue | Expenses | Profit |
|---------|---------|----------|--------|
| Q1 | 500000 | 350000 | 150000 |
| Q2 | 620000 | 380000 | 240000 |
PowerPoint Output
# Slide 1: Company Overview
Founded in 2020, we serve 50+ enterprise clients.
*Speaker Notes: Emphasize growth trajectory.*
---
# Slide 2: Revenue Growth
- Q1: $2.1M
- Q2: $3.4M (+62%)
- Q3: $4.8M (+41%)
*Speaker Notes: Highlight Q2 as inflection point.*
Configuration
Custom Converter Options
from markitdown import MarkItDown
# With custom settings
md = MarkItDown(
llm_client=None, # Optional LLM client
llm_model=None, # LLM model name
)
# Convert with options
result = md.convert("document.pdf")
# Access metadata
print(f"Title: {result.title}")
print(f"Content length: {len(result.text_content)}")
Environment Variables
# For OpenAI image descriptions
export OPENAI_API_KEY=sk-...
# For audio transcription
export SPEECH_KEY=your-azure-key
export SPEECH_REGION=eastus
Advanced Usage
Batch Processing
import os
import glob
from markitdown import MarkItDown
md = MarkItDown()
input_dir = "./documents/"
output_dir = "./markdown/"
os.makedirs(output_dir, exist_ok=True)
supported = ["*.pdf", "*.docx", "*.pptx", "*.xlsx", "*.html"]
files = []
for pattern in supported:
files.extend(glob.glob(f"{input_dir}/{pattern}"))
for file_path in files:
name = os.path.splitext(os.path.basename(file_path))[0]
try:
result = md.convert(file_path)
with open(f"{output_dir}/{name}.md", "w") as f:
f.write(result.text_content)
print(f"Converted: {file_path}")
except Exception as e:
print(f"Failed: {file_path} - {e}")
Integration with RAG Pipeline
from markitdown import MarkItDown
from langchain.text_splitter import RecursiveCharacterTextSplitter
md = MarkItDown()
# Convert document
result = md.convert("manual.docx")
# Chunk for RAG
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)
chunks = splitter.split_text(result.text_content)
# Ready for embedding and vector store indexing
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk[:80]}...")
ZIP File Processing
from markitdown import MarkItDown
md = MarkItDown()
# Process ZIP containing mixed documents
result = md.convert("project_docs.zip")
print(result.text_content)
# Output: concatenated Markdown from all supported files in the ZIP
Custom File Handler
from markitdown import MarkItDown
md = MarkItDown()
# Process from stream
with open("document.docx", "rb") as f:
result = md.convert_stream(f, file_extension=".docx")
print(result.text_content)
Troubleshooting
| Issue | Solution |
|---|---|
| PDF returns empty text | PDF may be image-based; use OCR tool like Marker |
| Excel formatting lost | Complex formatting is simplified to Markdown tables |
| Image returns no description | Set up LLM client for image descriptions |
| Audio transcription fails | Install audio dependencies: pip install markitdown[audio] |
| Unicode errors | Check file encoding, try opening with explicit encoding |
| Large file slow | Process pages/sheets selectively if possible |
| URL fetch fails | Check network connectivity, URL accessibility |
| ZIP nested archives | Only top-level files in ZIP are processed |
# Verify installation
python -c "from markitdown import MarkItDown; print('MarkItDown installed')"
# Test conversion
echo "Testing MarkItDown"
markitdown test.pdf | head -20
# Check supported formats
python -c "from markitdown import MarkItDown; md = MarkItDown(); print('Ready')"