pdf-parser
Overview
Abschnitt betitelt „Overview“pdf-parser is a sophisticated tool for analyzing PDF document structure at the object level. It displays PDF internals including streams, dictionaries, encodings, and object relationships without rendering the document. Security researchers, forensic analysts, and penetration testers use pdf-parser to understand PDF architecture, identify embedded objects, extract suspicious code, and debug PDF vulnerabilities.
Capabilities:
- Extract individual PDF objects
- Parse stream contents and encodings
- Display object relationships
- Analyze document structure
- Identify embedded files and code
- Decode various PDF encodings
- Search for specific patterns
- Generate detailed reports
Installation
Abschnitt betitelt „Installation“Linux/Debian
Abschnitt betitelt „Linux/Debian“# Install from repository (usually included)
sudo apt-get install pdfid
# Or standalone
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
# Copy pdf-parser.py to PATH
# Using Homebrew
brew install pdf-parser
# Or from source
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
chmod +x pdf-parser.py
sudo cp pdf-parser.py /usr/local/bin/
Windows
Abschnitt betitelt „Windows“# Download from DidierStevens suite
https://github.com/DidierStevens/DidierStevensSuite
# Requires Python 3
python pdf-parser.py document.pdf
Python Installation
Abschnitt betitelt „Python Installation“# Direct Python execution
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
# Run directly
python pdf-parser.py document.pdf
Basic Usage
Abschnitt betitelt „Basic Usage“Quick Start
Abschnitt betitelt „Quick Start“# Basic PDF structure dump
pdf-parser.py document.pdf
# Show first 50 lines
pdf-parser.py document.pdf | head -50
# Get summary view
pdf-parser.py -a document.pdf | head -100
# Verbose output
pdf-parser.py -v document.pdf
Essential Commands
Abschnitt betitelt „Essential Commands“| Command | Purpose |
|---|---|
pdf-parser.py file.pdf | Display all objects |
pdf-parser.py -o 5 file.pdf | Extract object 5 |
pdf-parser.py -a file.pdf | Show all details |
pdf-parser.py -r file.pdf | Recursive analysis |
pdf-parser.py -g file.pdf | Generate report |
pdf-parser.py -s keyword file.pdf | Search keyword |
Object Extraction
Abschnitt betitelt „Object Extraction“Extract Specific Objects
Abschnitt betitelt „Extract Specific Objects“# Extract object by ID
pdf-parser.py -o 5 document.pdf
# Extract multiple objects
pdf-parser.py -o 5,7,10 document.pdf
# Extract object and references
pdf-parser.py -O document.pdf | grep "obj 5" -A 50
# Extract with decoded streams
pdf-parser.py -o 5 -a document.pdf
Object Structure Analysis
Abschnitt betitelt „Object Structure Analysis“# Display object relationships
pdf-parser.py document.pdf | grep -E "obj|stream|endobj"
# Show all object IDs
pdf-parser.py document.pdf | grep "^obj" | awk '{print $2}'
# Count total objects
pdf-parser.py document.pdf | grep -c "^obj"
# Find objects by type
pdf-parser.py document.pdf | grep -E "/Type.*Stream|/Type.*Font"
Stream Analysis
Abschnitt betitelt „Stream Analysis“Extract and Decode Streams
Abschnitt betitelt „Extract and Decode Streams“# Get stream from object
pdf-parser.py -o 10 document.pdf | grep -A 20 "stream"
# Extract decoded stream
pdf-parser.py -o 10 -a document.pdf
# Show stream filters/encoding
pdf-parser.py -o 10 document.pdf | grep -i "filter\|decode"
# Extract raw stream data
pdf-parser.py -o 10 document.pdf | sed -n '/stream/,/endstream/p'
Common Stream Types
Abschnitt betitelt „Common Stream Types“| Stream Type | Description |
|---|---|
| FlateDecode | Zlib/deflate compression |
| ASCII85Decode | Base85 encoded |
| ASCIIHexDecode | Hex encoded |
| LZWDecode | LZW compression |
| RawStream | No encoding |
| XObject | Images/graphics |
| Font | Embedded fonts |
Decode Streams
Abschnitt betitelt „Decode Streams“# Extract and save compressed stream
pdf-parser.py -o 15 -a document.pdf > object15.txt
# Decompress FlateDecode stream
pdf-parser.py -o 15 document.pdf | \
sed -n '/stream/,/endstream/p' | \
python -c "import sys,zlib; print(zlib.decompress(sys.stdin.read()))"
# Decode ASCII85
pdf-parser.py -o 20 -a document.pdf
Search and Pattern Matching
Abschnitt betitelt „Search and Pattern Matching“Search for Keywords
Abschnitt betitelt „Search for Keywords“# Search for suspicious keywords
pdf-parser.py -s /JavaScript document.pdf
# Search for launch objects
pdf-parser.py -s /Launch document.pdf
# Multiple keyword search
pdf-parser.py -s /AA document.pdf
pdf-parser.py -s /OpenAction document.pdf
# Case-insensitive search
pdf-parser.py -s javascript document.pdf
Pattern Detection
Abschnitt betitelt „Pattern Detection“# Find all embedded files
pdf-parser.py document.pdf | grep -i "embedfile\|filename"
# Identify JavaScript code
pdf-parser.py -s /JavaScript document.pdf
# Locate form objects
pdf-parser.py -s /AcroForm document.pdf
# Find encryption references
pdf-parser.py -s /Encrypt document.pdf
Grep-Based Analysis
Abschnitt betitelt „Grep-Based Analysis“# Extract all streams containing "http"
pdf-parser.py document.pdf | grep -B 5 "http"
# Find suspicious commands
pdf-parser.py document.pdf | grep -iE "exec\|system\|shell\|cmd"
# Identify ActiveX objects
pdf-parser.py document.pdf | grep -i "activex\|ocx"
# Locate plugins
pdf-parser.py document.pdf | grep -i "plugin\|extension"
Document Structure Analysis
Abschnitt betitelt „Document Structure Analysis“Header and Trailer
Abschnitt betitelt „Header and Trailer“# View PDF header
pdf-parser.py document.pdf | head -5
# Show trailer dictionary
pdf-parser.py document.pdf | grep -A 10 "trailer"
# Display cross-reference table
pdf-parser.py document.pdf | grep -E "xref|trailer"
# Identify PDF version
pdf-parser.py document.pdf | grep "%PDF"
Pages and Content
Abschnitt betitelt „Pages and Content“# Find page objects
pdf-parser.py -s /Pages document.pdf
# Extract page tree
pdf-parser.py document.pdf | grep -E "Pages|Kids|Parent"
# Get page count
pdf-parser.py -s /Count document.pdf | grep -i "count"
# Analyze page resources
pdf-parser.py -s /Resources document.pdf
Font Analysis
Abschnitt betitelt „Font Analysis“# Find embedded fonts
pdf-parser.py -s /Font document.pdf
# Extract font names
pdf-parser.py document.pdf | grep -E "/BaseFont|/FontName"
# Identify suspicious fonts
pdf-parser.py document.pdf | grep -i "font" | grep -v "standard"
Advanced Filtering
Abschnitt betitelt „Advanced Filtering“Recursive Analysis
Abschnitt betitelt „Recursive Analysis“# Show object hierarchy
pdf-parser.py -r document.pdf
# Display all references
pdf-parser.py -r document.pdf | grep -E "obj|ref"
# Complete object tree
pdf-parser.py -r -a document.pdf
Object References
Abschnitt betitelt „Object References“# Find all references to object 5
pdf-parser.py document.pdf | grep "5 0 R"
# Map object dependencies
pdf-parser.py document.pdf | grep -E "obj|\/Type" | sort
# Create object graph
pdf-parser.py document.pdf | grep -E "[0-9]+ 0 R" | sort -u
Batch Analysis
Abschnitt betitelt „Batch Analysis“Process Multiple Files
Abschnitt betitelt „Process Multiple Files“#!/bin/bash
# Analyze all PDFs
for pdf in *.pdf; do
echo "=== $pdf ==="
pdf-parser.py "$pdf" | \
grep -iE "javascript|launch|embedfile|openaction"
done
Comparative Analysis
Abschnitt betitelt „Comparative Analysis“#!/bin/bash
# Compare objects across PDFs
PDF1="document1.pdf"
PDF2="document2.pdf"
echo "Objects in $PDF1:"
pdf-parser.py "$PDF1" | grep "^obj" | wc -l
echo "Objects in $PDF2:"
pdf-parser.py "$PDF2" | grep "^obj" | wc -l
echo "Unique objects in $PDF1:"
pdf-parser.py "$PDF1" | grep "^obj" | sort > pdf1_objs.txt
pdf-parser.py "$PDF2" | grep "^obj" | sort > pdf2_objs.txt
diff pdf1_objs.txt pdf2_objs.txt
Automated Threat Detection
Abschnitt betitelt „Automated Threat Detection“#!/bin/bash
# Auto-detect suspicious content
PDF="$1"
THREAT_LEVEL=0
# Check for JavaScript
if pdf-parser.py "$PDF" | grep -qi "/JavaScript"; then
echo "[CRITICAL] JavaScript found"
THREAT_LEVEL=$((THREAT_LEVEL + 50))
fi
# Check for Launch
if pdf-parser.py "$PDF" | grep -qi "/Launch"; then
echo "[CRITICAL] Launch action found"
THREAT_LEVEL=$((THREAT_LEVEL + 50))
fi
# Check for embedded files
if pdf-parser.py "$PDF" | grep -qi "/EmbeddedFile"; then
echo "[HIGH] Embedded file detected"
THREAT_LEVEL=$((THREAT_LEVEL + 25))
fi
echo "Threat Level: $THREAT_LEVEL"
Encoding and Decoding
Abschnitt betitelt „Encoding and Decoding“Handle Various Encodings
Abschnitt betitelt „Handle Various Encodings“# Identify encoding in object
pdf-parser.py -o 10 document.pdf | grep -i "filter\|encoding"
# Show before/after encoding
pdf-parser.py -o 10 document.pdf
# Extract hex-encoded data
pdf-parser.py document.pdf | grep -E "^<|^>" | head -20
# Decode hex to ASCII
pdf-parser.py document.pdf | \
grep -E "^<[A-F0-9]+>$" | \
sed 's/<//g;s/>//g' | \
xxd -r -p
Compression Analysis
Abschnitt betitelt „Compression Analysis“# Find FlateDecode objects
pdf-parser.py -s FlateDecode document.pdf
# Count compression types
pdf-parser.py document.pdf | \
grep -i "filter" | \
sort | \
uniq -c
# Extract and decompress
pdf-parser.py -o 15 -a document.pdf > decompressed.txt
Integration with Other Tools
Abschnitt betitelt „Integration with Other Tools“Combine with pdfid
Abschnitt betitelt „Combine with pdfid“#!/bin/bash
# Two-stage analysis
PDF="$1"
echo "=== pdfid Quick Scan ==="
pdfid.py "$PDF"
echo ""
echo "=== pdf-parser Detailed Analysis ==="
pdf-parser.py "$PDF" | head -100
Workflow with Strings
Abschnitt betitelt „Workflow with Strings“# Extract all strings from PDF
pdf-parser.py document.pdf | strings | grep -E "http|ftp|cmd|exec"
# Find URLs
pdf-parser.py document.pdf | strings | grep -oE "https?://[^\s]+"
# Extract email addresses
pdf-parser.py document.pdf | strings | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
Analysis Pipeline
Abschnitt betitelt „Analysis Pipeline“#!/bin/bash
# Complete PDF forensic pipeline
PDF="$1"
REPORT="${PDF%.pdf}.analysis"
{
echo "PDF Forensic Report: $PDF"
echo "Generated: $(date)"
echo ""
echo "=== Object Count ==="
pdf-parser.py "$PDF" | grep -c "^obj"
echo ""
echo "=== Suspicious Objects ==="
pdf-parser.py -s /JavaScript "$PDF" | head -20
echo ""
echo "=== Streams ==="
pdf-parser.py "$PDF" | grep -E "^obj.*stream" | wc -l
echo ""
echo "=== References ==="
pdf-parser.py "$PDF" | grep -E "[0-9]+ 0 R" | sort -u | head -10
} > "$REPORT"
cat "$REPORT"
Output Examples
Abschnitt betitelt „Output Examples“Standard Object Display
Abschnitt betitelt „Standard Object Display“obj 5 0
Type: /Catalog
Referencing: 6 0 R
Contains stream: No
<<
/Type /Catalog
/Pages 6 0 R
/OpenAction 7 0 R
>>
obj 7 0
Type: /OpenAction
Referencing: None
Contains stream: No
<<
/Type /Action
/S /JavaScript
/JS 8 0 R
>>
Stream Object Display
Abschnitt betitelt „Stream Object Display“obj 15 0
Type: /XObject
Referencing: None
Contains stream: Yes
Stream size: 1024
Filter: /FlateDecode
<<
/Type /XObject
/Subtype /Image
/Filter /FlateDecode
/Width 100
/Height 100
>>
[compressed data]
Performance Tips
Abschnitt betitelt „Performance Tips“Large PDF Handling
Abschnitt betitelt „Large PDF Handling“# Process incrementally
pdf-parser.py document.pdf | head -1000 > first_part.txt
# Use grep efficiently
pdf-parser.py document.pdf | grep "/JavaScript" | head -5
# Parallel analysis of multiple files
find . -name "*.pdf" -type f | xargs -P 4 -I {} \
pdf-parser.py {} | grep -i "javascript"
Memory Optimization
Abschnitt betitelt „Memory Optimization“# For huge PDFs, limit output
pdf-parser.py document.pdf | head -5000 > summary.txt
# Extract specific objects only
pdf-parser.py -o 5,6,7 document.pdf
# Use streaming grep
pdf-parser.py document.pdf | \
grep "/JavaScript\|/Launch\|/EmbeddedFile"
Comparison with Alternatives
Abschnitt betitelt „Comparison with Alternatives“| Feature | pdf-parser | pdfid | peepdf |
|---|---|---|---|
| Object extraction | Yes | No | Yes |
| Stream parsing | Yes | No | Yes |
| Malware detection | Limited | Yes | Yes |
| Interactive mode | No | No | Yes |
| Speed | Fast | Faster | Slow |
Troubleshooting
Abschnitt betitelt „Troubleshooting“Common Issues
Abschnitt betitelt „Common Issues“Encoding Errors:
# Handle different encodings
pdf-parser.py -a document.pdf
# Try different locales
LC_ALL=en_US.UTF-8 pdf-parser.py document.pdf
Corrupted PDFs:
# pdf-parser is tolerant of corruption
pdf-parser.py corrupted.pdf
# Try repair first
qpdf --repair corrupted.pdf fixed.pdf
pdf-parser.py fixed.pdf
Large File Performance:
# Extract specific objects to reduce output
pdf-parser.py -o 1 document.pdf
# Process streams separately
pdf-parser.py -s stream document.pdf | head -100
Resources
Abschnitt betitelt „Resources“- Official Site: https://blog.didierstevens.com/programs/pdf-tools/
- GitHub: https://github.com/DidierStevens/DidierStevensSuite
- PDF Reference: https://www.adobe.io/open/standards/PDFRM.html
- Examples: https://blog.didierstevens.com/2019/03/10/pdf-stream-lazy-eval/
Legal Considerations
Abschnitt betitelt „Legal Considerations“Authorized Use
Abschnitt betitelt „Authorized Use“- Analyze PDFs from trusted sources
- Use in isolated environments for suspicious files
- Document analysis methodology
- Comply with applicable laws and regulations
Safety Practices
Abschnitt betitelt „Safety Practices“- Never render suspicious PDFs in standard viewers
- Use virtual machines for hostile document analysis
- Implement proper logging and audit trails
- Follow organizational security policies