コンテンツにスキップ

MarkItDown Cheat Sheet

Overview

MarkItDown is a lightweight Python utility by Microsoft that converts various file formats to Markdown text. It supports Office documents (Word, Excel, PowerPoint), PDFs, images (with optional LLM-based descriptions), audio (with speech-to-text), HTML, and other formats. The tool is designed for preprocessing documents for LLM consumption, RAG pipelines, and content indexing.

Unlike heavy document parsing libraries, MarkItDown focuses on simplicity and speed, producing clean Markdown output with minimal dependencies. It handles the most common business document formats and can be used as a Python library or command-line tool.

Installation

pip install markitdown

# With optional dependencies
pip install "markitdown[all]"

# With image description support (requires LLM)
pip install "markitdown[llm]"

# With audio transcription
pip install "markitdown[audio]"

Core Usage

Command Line

# Convert file to Markdown
markitdown document.docx > output.md

# Convert PDF
markitdown report.pdf > report.md

# Convert PowerPoint
markitdown presentation.pptx > slides.md

# Convert Excel
markitdown data.xlsx > data.md

# Convert HTML
markitdown page.html > page.md

# Convert from URL
markitdown https://example.com/article > article.md

# Convert image (basic)
markitdown screenshot.png > description.md

# Pipe input
cat document.docx | markitdown > output.md

Python API

from markitdown import MarkItDown

md = MarkItDown()

# Convert file
result = md.convert("report.pdf")
print(result.text_content)

# Convert Word document
result = md.convert("proposal.docx")
print(result.text_content)

# Convert Excel
result = md.convert("data.xlsx")
print(result.text_content)

# Convert PowerPoint
result = md.convert("deck.pptx")
print(result.text_content)

# Convert HTML
result = md.convert("page.html")
print(result.text_content)

# Convert from URL
result = md.convert("https://example.com/article")
print(result.text_content)

Supported Formats

FormatExtensionNotes
PDF.pdfText extraction (no OCR)
Word.docxFull formatting preserved
Excel.xlsx, .xlsTables to Markdown tables
PowerPoint.pptxSlides with speaker notes
HTML.html, .htmClean text extraction
CSV.csvConverted to Markdown table
JSON.jsonFormatted output
XML.xmlText extraction
Images.jpg, .png, .gifOptional LLM description
Audio.mp3, .wavSpeech-to-text transcription
ZIP.zipProcesses contained files
Plain text.txt, .md, .rstPass-through or conversion

LLM-Powered Features

Image Description

from markitdown import MarkItDown
from openai import OpenAI

# With OpenAI for image descriptions
client = OpenAI(api_key="sk-...")
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Convert image with AI description
result = md.convert("diagram.png")
print(result.text_content)
# Output: "A flowchart showing the data pipeline architecture..."

Audio Transcription

from markitdown import MarkItDown

md = MarkItDown()

# Transcribe audio file
result = md.convert("meeting.mp3")
print(result.text_content)
# Output: Transcribed text from the audio

Output Examples

Word Document Output

# Project Proposal

## Executive Summary

This document outlines the proposed system architecture...

## Requirements

1. High availability (99.9% uptime)
2. Sub-100ms response latency
3. Support for 10,000 concurrent users

### Technical Requirements

| Component | Specification | Priority |
|-----------|--------------|----------|
| Database | PostgreSQL 16 | High |
| Cache | Redis 7.x | High |
| Queue | RabbitMQ | Medium |

Excel Output

## Sheet1

| Name | Department | Salary | Start Date |
|------|-----------|--------|------------|
| Alice | Engineering | 120000 | 2023-01-15 |
| Bob | Marketing | 95000 | 2022-06-01 |
| Carol | Engineering | 115000 | 2023-03-20 |

## Sheet2

| Quarter | Revenue | Expenses | Profit |
|---------|---------|----------|--------|
| Q1 | 500000 | 350000 | 150000 |
| Q2 | 620000 | 380000 | 240000 |

PowerPoint Output

# Slide 1: Company Overview

Founded in 2020, we serve 50+ enterprise clients.

*Speaker Notes: Emphasize growth trajectory.*

---

# Slide 2: Revenue Growth

- Q1: $2.1M
- Q2: $3.4M (+62%)
- Q3: $4.8M (+41%)

*Speaker Notes: Highlight Q2 as inflection point.*

Configuration

Custom Converter Options

from markitdown import MarkItDown

# With custom settings
md = MarkItDown(
    llm_client=None,          # Optional LLM client
    llm_model=None,           # LLM model name
)

# Convert with options
result = md.convert("document.pdf")

# Access metadata
print(f"Title: {result.title}")
print(f"Content length: {len(result.text_content)}")

Environment Variables

# For OpenAI image descriptions
export OPENAI_API_KEY=sk-...

# For audio transcription
export SPEECH_KEY=your-azure-key
export SPEECH_REGION=eastus

Advanced Usage

Batch Processing

import os
import glob
from markitdown import MarkItDown

md = MarkItDown()

input_dir = "./documents/"
output_dir = "./markdown/"
os.makedirs(output_dir, exist_ok=True)

supported = ["*.pdf", "*.docx", "*.pptx", "*.xlsx", "*.html"]
files = []
for pattern in supported:
    files.extend(glob.glob(f"{input_dir}/{pattern}"))

for file_path in files:
    name = os.path.splitext(os.path.basename(file_path))[0]
    try:
        result = md.convert(file_path)
        with open(f"{output_dir}/{name}.md", "w") as f:
            f.write(result.text_content)
        print(f"Converted: {file_path}")
    except Exception as e:
        print(f"Failed: {file_path} - {e}")

Integration with RAG Pipeline

from markitdown import MarkItDown
from langchain.text_splitter import RecursiveCharacterTextSplitter

md = MarkItDown()

# Convert document
result = md.convert("manual.docx")

# Chunk for RAG
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)
chunks = splitter.split_text(result.text_content)

# Ready for embedding and vector store indexing
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk[:80]}...")

ZIP File Processing

from markitdown import MarkItDown

md = MarkItDown()

# Process ZIP containing mixed documents
result = md.convert("project_docs.zip")
print(result.text_content)
# Output: concatenated Markdown from all supported files in the ZIP

Custom File Handler

from markitdown import MarkItDown

md = MarkItDown()

# Process from stream
with open("document.docx", "rb") as f:
    result = md.convert_stream(f, file_extension=".docx")
    print(result.text_content)

Troubleshooting

IssueSolution
PDF returns empty textPDF may be image-based; use OCR tool like Marker
Excel formatting lostComplex formatting is simplified to Markdown tables
Image returns no descriptionSet up LLM client for image descriptions
Audio transcription failsInstall audio dependencies: pip install markitdown[audio]
Unicode errorsCheck file encoding, try opening with explicit encoding
Large file slowProcess pages/sheets selectively if possible
URL fetch failsCheck network connectivity, URL accessibility
ZIP nested archivesOnly top-level files in ZIP are processed
# Verify installation
python -c "from markitdown import MarkItDown; print('MarkItDown installed')"

# Test conversion
echo "Testing MarkItDown"
markitdown test.pdf | head -20

# Check supported formats
python -c "from markitdown import MarkItDown; md = MarkItDown(); print('Ready')"