pdf-parser

Overview

pdf-parser is a sophisticated tool for analyzing PDF document structure at the object level. It displays PDF internals including streams, dictionaries, encodings, and object relationships without rendering the document. Security researchers, forensic analysts, and penetration testers use pdf-parser to understand PDF architecture, identify embedded objects, extract suspicious code, and debug PDF vulnerabilities.

Capabilities:

Extract individual PDF objects
Parse stream contents and encodings
Display object relationships
Analyze document structure
Identify embedded files and code
Decode various PDF encodings
Search for specific patterns
Generate detailed reports

Installation

Linux/Debian

# Install from repository (usually included)
sudo apt-get install pdfid

# Or standalone
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
# Copy pdf-parser.py to PATH

macOS

# Using Homebrew
brew install pdf-parser

# Or from source
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
chmod +x pdf-parser.py
sudo cp pdf-parser.py /usr/local/bin/

Windows

# Download from DidierStevens suite
https://github.com/DidierStevens/DidierStevensSuite

# Requires Python 3
python pdf-parser.py document.pdf

Python Installation

# Direct Python execution
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite

# Run directly
python pdf-parser.py document.pdf

Basic Usage

Quick Start

# Basic PDF structure dump
pdf-parser.py document.pdf

# Show first 50 lines
pdf-parser.py document.pdf | head -50

# Get summary view
pdf-parser.py -a document.pdf | head -100

# Verbose output
pdf-parser.py -v document.pdf

Essential Commands

Command	Purpose
`pdf-parser.py file.pdf`	Display all objects
`pdf-parser.py -o 5 file.pdf`	Extract object 5
`pdf-parser.py -a file.pdf`	Show all details
`pdf-parser.py -r file.pdf`	Recursive analysis
`pdf-parser.py -g file.pdf`	Generate report
`pdf-parser.py -s keyword file.pdf`	Search keyword

Object Extraction

Extract Specific Objects

# Extract object by ID
pdf-parser.py -o 5 document.pdf

# Extract multiple objects
pdf-parser.py -o 5,7,10 document.pdf

# Extract object and references
pdf-parser.py -O document.pdf | grep "obj 5" -A 50

# Extract with decoded streams
pdf-parser.py -o 5 -a document.pdf

Object Structure Analysis

# Display object relationships
pdf-parser.py document.pdf | grep -E "obj|stream|endobj"

# Show all object IDs
pdf-parser.py document.pdf | grep "^obj" | awk '{print $2}'

# Count total objects
pdf-parser.py document.pdf | grep -c "^obj"

# Find objects by type
pdf-parser.py document.pdf | grep -E "/Type.*Stream|/Type.*Font"

Stream Analysis

Extract and Decode Streams

# Get stream from object
pdf-parser.py -o 10 document.pdf | grep -A 20 "stream"

# Extract decoded stream
pdf-parser.py -o 10 -a document.pdf

# Show stream filters/encoding
pdf-parser.py -o 10 document.pdf | grep -i "filter\|decode"

# Extract raw stream data
pdf-parser.py -o 10 document.pdf | sed -n '/stream/,/endstream/p'

Common Stream Types

Stream Type	Description
FlateDecode	Zlib/deflate compression
ASCII85Decode	Base85 encoded
ASCIIHexDecode	Hex encoded
LZWDecode	LZW compression
RawStream	No encoding
XObject	Images/graphics
Font	Embedded fonts

Decode Streams

# Extract and save compressed stream
pdf-parser.py -o 15 -a document.pdf > object15.txt

# Decompress FlateDecode stream
pdf-parser.py -o 15 document.pdf | \
  sed -n '/stream/,/endstream/p' | \
  python -c "import sys,zlib; print(zlib.decompress(sys.stdin.read()))"

# Decode ASCII85
pdf-parser.py -o 20 -a document.pdf

Search and Pattern Matching

Search for Keywords

# Search for suspicious keywords
pdf-parser.py -s /JavaScript document.pdf

# Search for launch objects
pdf-parser.py -s /Launch document.pdf

# Multiple keyword search
pdf-parser.py -s /AA document.pdf
pdf-parser.py -s /OpenAction document.pdf

# Case-insensitive search
pdf-parser.py -s javascript document.pdf

Pattern Detection

# Find all embedded files
pdf-parser.py document.pdf | grep -i "embedfile\|filename"

# Identify JavaScript code
pdf-parser.py -s /JavaScript document.pdf

# Locate form objects
pdf-parser.py -s /AcroForm document.pdf

# Find encryption references
pdf-parser.py -s /Encrypt document.pdf

Grep-Based Analysis

# Extract all streams containing "http"
pdf-parser.py document.pdf | grep -B 5 "http"

# Find suspicious commands
pdf-parser.py document.pdf | grep -iE "exec\|system\|shell\|cmd"

# Identify ActiveX objects
pdf-parser.py document.pdf | grep -i "activex\|ocx"

# Locate plugins
pdf-parser.py document.pdf | grep -i "plugin\|extension"

Document Structure Analysis

Header and Trailer

# View PDF header
pdf-parser.py document.pdf | head -5

# Show trailer dictionary
pdf-parser.py document.pdf | grep -A 10 "trailer"

# Display cross-reference table
pdf-parser.py document.pdf | grep -E "xref|trailer"

# Identify PDF version
pdf-parser.py document.pdf | grep "%PDF"

Pages and Content

# Find page objects
pdf-parser.py -s /Pages document.pdf

# Extract page tree
pdf-parser.py document.pdf | grep -E "Pages|Kids|Parent"

# Get page count
pdf-parser.py -s /Count document.pdf | grep -i "count"

# Analyze page resources
pdf-parser.py -s /Resources document.pdf

Font Analysis

# Find embedded fonts
pdf-parser.py -s /Font document.pdf

# Extract font names
pdf-parser.py document.pdf | grep -E "/BaseFont|/FontName"

# Identify suspicious fonts
pdf-parser.py document.pdf | grep -i "font" | grep -v "standard"

Advanced Filtering

Recursive Analysis

# Show object hierarchy
pdf-parser.py -r document.pdf

# Display all references
pdf-parser.py -r document.pdf | grep -E "obj|ref"

# Complete object tree
pdf-parser.py -r -a document.pdf

Object References

# Find all references to object 5
pdf-parser.py document.pdf | grep "5 0 R"

# Map object dependencies
pdf-parser.py document.pdf | grep -E "obj|\/Type" | sort

# Create object graph
pdf-parser.py document.pdf | grep -E "[0-9]+ 0 R" | sort -u

Batch Analysis

Process Multiple Files

#!/bin/bash
# Analyze all PDFs

for pdf in *.pdf; do
    echo "=== $pdf ==="
    pdf-parser.py "$pdf" | \
      grep -iE "javascript|launch|embedfile|openaction"
done

Comparative Analysis

#!/bin/bash
# Compare objects across PDFs

PDF1="document1.pdf"
PDF2="document2.pdf"

echo "Objects in $PDF1:"
pdf-parser.py "$PDF1" | grep "^obj" | wc -l

echo "Objects in $PDF2:"
pdf-parser.py "$PDF2" | grep "^obj" | wc -l

echo "Unique objects in $PDF1:"
pdf-parser.py "$PDF1" | grep "^obj" | sort > pdf1_objs.txt
pdf-parser.py "$PDF2" | grep "^obj" | sort > pdf2_objs.txt
diff pdf1_objs.txt pdf2_objs.txt

Automated Threat Detection

#!/bin/bash
# Auto-detect suspicious content

PDF="$1"
THREAT_LEVEL=0

# Check for JavaScript
if pdf-parser.py "$PDF" | grep -qi "/JavaScript"; then
    echo "[CRITICAL] JavaScript found"
    THREAT_LEVEL=$((THREAT_LEVEL + 50))
fi

# Check for Launch
if pdf-parser.py "$PDF" | grep -qi "/Launch"; then
    echo "[CRITICAL] Launch action found"
    THREAT_LEVEL=$((THREAT_LEVEL + 50))
fi

# Check for embedded files
if pdf-parser.py "$PDF" | grep -qi "/EmbeddedFile"; then
    echo "[HIGH] Embedded file detected"
    THREAT_LEVEL=$((THREAT_LEVEL + 25))
fi

echo "Threat Level: $THREAT_LEVEL"

Encoding and Decoding

Handle Various Encodings

# Identify encoding in object
pdf-parser.py -o 10 document.pdf | grep -i "filter\|encoding"

# Show before/after encoding
pdf-parser.py -o 10 document.pdf

# Extract hex-encoded data
pdf-parser.py document.pdf | grep -E "^<|^>" | head -20

# Decode hex to ASCII
pdf-parser.py document.pdf | \
  grep -E "^<[A-F0-9]+>$" | \
  sed 's/<//g;s/>//g' | \
  xxd -r -p

Compression Analysis

# Find FlateDecode objects
pdf-parser.py -s FlateDecode document.pdf

# Count compression types
pdf-parser.py document.pdf | \
  grep -i "filter" | \
  sort | \
  uniq -c

# Extract and decompress
pdf-parser.py -o 15 -a document.pdf > decompressed.txt

Integration with Other Tools

Combine with pdfid

#!/bin/bash
# Two-stage analysis

PDF="$1"

echo "=== pdfid Quick Scan ==="
pdfid.py "$PDF"

echo ""
echo "=== pdf-parser Detailed Analysis ==="
pdf-parser.py "$PDF" | head -100

Workflow with Strings

# Extract all strings from PDF
pdf-parser.py document.pdf | strings | grep -E "http|ftp|cmd|exec"

# Find URLs
pdf-parser.py document.pdf | strings | grep -oE "https?://[^\s]+"

# Extract email addresses
pdf-parser.py document.pdf | strings | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

Analysis Pipeline

#!/bin/bash
# Complete PDF forensic pipeline

PDF="$1"
REPORT="${PDF%.pdf}.analysis"

{
    echo "PDF Forensic Report: $PDF"
    echo "Generated: $(date)"
    echo ""
    
    echo "=== Object Count ==="
    pdf-parser.py "$PDF" | grep -c "^obj"
    
    echo ""
    echo "=== Suspicious Objects ==="
    pdf-parser.py -s /JavaScript "$PDF" | head -20
    
    echo ""
    echo "=== Streams ==="
    pdf-parser.py "$PDF" | grep -E "^obj.*stream" | wc -l
    
    echo ""
    echo "=== References ==="
    pdf-parser.py "$PDF" | grep -E "[0-9]+ 0 R" | sort -u | head -10
    
} > "$REPORT"

cat "$REPORT"

Output Examples

Standard Object Display

obj 5 0
 Type: /Catalog
 Referencing: 6 0 R
 Contains stream: No
  <<
  /Type /Catalog
  /Pages 6 0 R
  /OpenAction 7 0 R
  >>

obj 7 0
 Type: /OpenAction
 Referencing: None
 Contains stream: No
  <<
  /Type /Action
  /S /JavaScript
  /JS 8 0 R
  >>

Stream Object Display

obj 15 0
 Type: /XObject
 Referencing: None
 Contains stream: Yes
 Stream size: 1024
 Filter: /FlateDecode
  <<
  /Type /XObject
  /Subtype /Image
  /Filter /FlateDecode
  /Width 100
  /Height 100
  >>
  [compressed data]

Performance Tips

Large PDF Handling

# Process incrementally
pdf-parser.py document.pdf | head -1000 > first_part.txt

# Use grep efficiently
pdf-parser.py document.pdf | grep "/JavaScript" | head -5

# Parallel analysis of multiple files
find . -name "*.pdf" -type f | xargs -P 4 -I {} \
  pdf-parser.py {} | grep -i "javascript"

Memory Optimization

# For huge PDFs, limit output
pdf-parser.py document.pdf | head -5000 > summary.txt

# Extract specific objects only
pdf-parser.py -o 5,6,7 document.pdf

# Use streaming grep
pdf-parser.py document.pdf | \
  grep "/JavaScript\|/Launch\|/EmbeddedFile"

Comparison with Alternatives

Feature	pdf-parser	pdfid	peepdf
Object extraction	Yes	No	Yes
Stream parsing	Yes	No	Yes
Malware detection	Limited	Yes	Yes
Interactive mode	No	No	Yes
Speed	Fast	Faster	Slow

Troubleshooting

Common Issues

Encoding Errors:

# Handle different encodings
pdf-parser.py -a document.pdf

# Try different locales
LC_ALL=en_US.UTF-8 pdf-parser.py document.pdf

Corrupted PDFs:

# pdf-parser is tolerant of corruption
pdf-parser.py corrupted.pdf

# Try repair first
qpdf --repair corrupted.pdf fixed.pdf
pdf-parser.py fixed.pdf

Large File Performance:

# Extract specific objects to reduce output
pdf-parser.py -o 1 document.pdf

# Process streams separately
pdf-parser.py -s stream document.pdf | head -100

Resources

Official Site: https://blog.didierstevens.com/programs/pdf-tools/
GitHub: https://github.com/DidierStevens/DidierStevensSuite
PDF Reference: https://www.adobe.io/open/standards/PDFRM.html
Examples: https://blog.didierstevens.com/2019/03/10/pdf-stream-lazy-eval/

Legal Considerations

Authorized Use

Analyze PDFs from trusted sources
Use in isolated environments for suspicious files
Document analysis methodology
Comply with applicable laws and regulations

Safety Practices

Never render suspicious PDFs in standard viewers
Use virtual machines for hostile document analysis
Implement proper logging and audit trails
Follow organizational security policies