pdfid
Overview
Abschnitt betitelt „Overview“pdfid is a forensic analysis tool designed to scan PDF documents for suspicious elements, embedded code, and potential malware indicators. Rather than rendering PDFs (which could trigger exploits), pdfid analyzes the document structure to identify dangerous objects, JavaScript, Flash, and other security threats. It’s essential for security researchers, incident responders, and malware analysts.
Key Features:
- Detect JavaScript and active content
- Identify suspicious keywords and objects
- Find embedded files (Flash, executables)
- Analyze encryption and permissions
- Generate risk scores
- Flag known malware patterns
- Cross-platform compatibility
Installation
Abschnitt betitelt „Installation“Linux/Debian
Abschnitt betitelt „Linux/Debian“# Install from repository
sudo apt-get install pdfid
# Or from source
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
# Copy pdfid.py to your PATH
# Verify installation
pdfid.py --version
# Using Homebrew
brew install pdfid
# Or from source
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
chmod +x pdfid.py
sudo cp pdfid.py /usr/local/bin/
Windows
Abschnitt betitelt „Windows“# Download from official source
https://github.com/DidierStevens/DidierStevensSuite
# Requires Python 3
python pdfid.py document.pdf
# Add to PATH for convenience
setx PATH "%PATH%;C:\path\to\DidierStevensSuite"
Python Direct Installation
Abschnitt betitelt „Python Direct Installation“# Using pip
pip install pdfid
# Or clone and install
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
pip install -e .
Basic Usage
Abschnitt betitelt „Basic Usage“Quick PDF Scan
Abschnitt betitelt „Quick PDF Scan“# Basic scan of PDF file
pdfid.py document.pdf
# Scan with verbose output
pdfid.py -v document.pdf
# Scan multiple files
pdfid.py *.pdf
# Output to file
pdfid.py document.pdf -o results.txt
Essential Commands
Abschnitt betitelt „Essential Commands“| Command | Purpose |
|---|---|
pdfid.py file.pdf | Basic PDF analysis |
pdfid.py -a file.pdf | All analysis (detailed) |
pdfid.py -e file.pdf | Entropy analysis |
pdfid.py -p file.pdf | Peek at objects |
pdfid.py -r file.pdf | Extended analysis |
pdfid.py -v file.pdf | Verbose output |
Output Interpretation
Abschnitt betitelt „Output Interpretation“Standard Output Analysis
Abschnitt betitelt „Standard Output Analysis“$ pdfid.py suspicious.pdf
PDFiD 0.2.8 https://blog.didierstevens.com/programs/pdf-tools/
Result: Likely malicious document
Summary
-------
PDF Header: %PDF-1.5
Comment: True
Updates: 0
Encrypted: False
Suspicious elements detected: 8
Count Name
----- ----
3 /JS
2 /Launch
1 /EmbeddedFile
1 /OpenAction
1 /AA
2 /ObjStm
0 /XRef
0 /Encrypt
0 /JBIG2Decode
1 /RichMedia
0 /Flash
0 /XFA
1 /Acroform
Element Dictionary
Abschnitt betitelt „Element Dictionary“| Element | Risk Level | Description |
|---|---|---|
| /JS | Critical | JavaScript code |
| /Launch | Critical | Launch external programs |
| /SubmitForm | High | Form submission (data exfil) |
| /EmbeddedFile | High | Embedded executables |
| /OpenAction | High | Auto-execute on open |
| /AA | High | Auto-action events |
| /JBIG2Decode | Critical | JBIG2 codec (exploits) |
| /RichMedia | High | Rich media content |
| /Flash | Critical | Embedded Flash |
| /XFA | Medium | XFA forms |
| /Acroform | Medium | Interactive forms |
| /ObjStm | Medium | Object streams |
Detailed Analysis
Abschnitt betitelt „Detailed Analysis“Comprehensive Scanning
Abschnitt betitelt „Comprehensive Scanning“# Full analysis with all checks
pdfid.py -a document.pdf
# Include extended analysis
pdfid.py -e document.pdf
# Peek at suspicious objects
pdfid.py -p document.pdf
# Extract and display objects
pdfid.py -r document.pdf
JavaScript Detection
Abschnitt betitelt „JavaScript Detection“# Scan for JavaScript
pdfid.py document.pdf | grep -i "/JS"
# Show all JavaScript instances
pdfid.py -a document.pdf | grep -A 5 "/JS"
# Extract JavaScript code for analysis
pdfid.py -p document.pdf | grep -A 10 "stream"
Suspicious Object Analysis
Abschnitt betitelt „Suspicious Object Analysis“# Check for launch/execution objects
pdfid.py document.pdf | grep -E "/Launch|/OpenAction|/AA"
# Detect embedded files
pdfid.py document.pdf | grep -E "/EmbeddedFile|/ObjStm"
# Check encryption status
pdfid.py document.pdf | grep -i "Encrypted"
Advanced Scanning Options
Abschnitt betitelt „Advanced Scanning Options“Entropy Analysis
Abschnitt betitelt „Entropy Analysis“# Calculate entropy of PDF objects
pdfid.py -e document.pdf
# Entropy indicates randomness/compression
# High entropy: possible encryption or obfuscation
# Low entropy: likely plain text or known patterns
# Look for suspicious entropy spikes
pdfid.py -e document.pdf | grep -E "entropy|0\.9|0\.8"
Object Stream Analysis
Abschnitt betitelt „Object Stream Analysis“# Analyze object streams (often hide malware)
pdfid.py document.pdf | grep "/ObjStm"
# Count suspicious object streams
pdfid.py -a document.pdf | grep -c "/ObjStm"
# Detailed object stream inspection
pdfid.py -r document.pdf | grep -B 2 -A 5 "stream"
Obfuscation Detection
Abschnitt betitelt „Obfuscation Detection“# Check for obfuscated content
pdfid.py document.pdf | grep -E "ObjStm|Filter|Encrypt"
# Detect encoding and compression
pdfid.py -a document.pdf | grep -E "FlateDecode|ASCII|Encrypt"
# Find intentionally hidden objects
pdfid.py -p document.pdf
Batch Analysis
Abschnitt betitelt „Batch Analysis“Process Multiple Files
Abschnitt betitelt „Process Multiple Files“#!/bin/bash
# Scan all PDFs in directory
for pdf in *.pdf; do
echo "=== $pdf ==="
pdfid.py "$pdf" | grep -E "Result|Encrypted|/JS|/Launch"
done
Generate Risk Report
Abschnitt betitelt „Generate Risk Report“#!/bin/bash
# Create risk assessment report
OUTPUT="risk_report.txt"
echo "PDF Risk Assessment Report" > $OUTPUT
echo "Generated: $(date)" >> $OUTPUT
echo "" >> $OUTPUT
for pdf in *.pdf; do
result=$(pdfid.py "$pdf" 2>&1)
if echo "$result" | grep -q "malicious"; then
echo "HIGH RISK: $pdf" >> $OUTPUT
echo "$result" | grep -E "Count Name|-|/JS|/Launch|/AA" >> $OUTPUT
echo "" >> $OUTPUT
fi
done
cat $OUTPUT
Automated Threat Classification
Abschnitt betitelt „Automated Threat Classification“#!/bin/bash
# Classify PDFs by threat level
SAFE_DIR="safe"
SUSPICIOUS_DIR="suspicious"
MALICIOUS_DIR="malicious"
mkdir -p $SAFE_DIR $SUSPICIOUS_DIR $MALICIOUS_DIR
for pdf in *.pdf; do
result=$(pdfid.py "$pdf")
js=$(echo "$result" | grep "/JS" | awk '{print $1}')
launch=$(echo "$result" | grep "/Launch" | awk '{print $1}')
if [ "$js" -gt 0 ] || [ "$launch" -gt 0 ]; then
mv "$pdf" "$MALICIOUS_DIR/"
elif echo "$result" | grep -q "ObjStm\|Encrypt"; then
mv "$pdf" "$SUSPICIOUS_DIR/"
else
mv "$pdf" "$SAFE_DIR/"
fi
done
Integration with Other Tools
Abschnitt betitelt „Integration with Other Tools“Combine with pdf-parser
Abschnitt betitelt „Combine with pdf-parser“# Use pdfid to identify issues
pdfid.py suspicious.pdf
# Then use pdf-parser to examine objects
pdf-parser.py -a suspicious.pdf
# Extract specific objects
pdf-parser.py -o 5 suspicious.pdf
Workflow with pdf-triage
Abschnitt betitelt „Workflow with pdf-triage“#!/bin/bash
# Multi-stage PDF security analysis
PDF="$1"
echo "1. Initial scan with pdfid"
pdfid.py "$PDF"
echo "2. Detailed structure analysis"
pdf-parser.py "$PDF" | head -50
echo "3. JavaScript extraction (if present)"
pdfid.py -p "$PDF" | grep -A 20 "/JS"
Virustotal Integration
Abschnitt betitelt „Virustotal Integration“#!/bin/bash
# Check PDF against VirusTotal
PDF="$1"
# First, scan locally with pdfid
echo "Local Analysis:"
pdfid.py "$PDF"
# Upload to VirusTotal (requires API key)
hash=$(sha256sum "$PDF" | awk '{print $1}')
curl "https://www.virustotal.com/api/v3/files/$hash" \
-H "x-apikey: YOUR_API_KEY"
Risk Assessment Framework
Abschnitt betitelt „Risk Assessment Framework“Scoring System
Abschnitt betitelt „Scoring System“# Risk assessment based on elements found
#!/bin/bash
PDF="$1"
SCORE=0
# Critical elements (50 points each)
JS=$(pdfid.py "$PDF" | grep "^[0-9]* /JS" | awk '{print $1}')
LAUNCH=$(pdfid.py "$PDF" | grep "^[0-9]* /Launch" | awk '{print $1}')
JBIG=$(pdfid.py "$PDF" | grep "^[0-9]* /JBIG2Decode" | awk '{print $1}')
SCORE=$((SCORE + JS*50 + LAUNCH*50 + JBIG*50))
# High risk elements (25 points each)
OPENACTION=$(pdfid.py "$PDF" | grep "^[0-9]* /OpenAction" | awk '{print $1}')
AA=$(pdfid.py "$PDF" | grep "^[0-9]* /AA" | awk '{print $1}')
SCORE=$((SCORE + OPENACTION*25 + AA*25))
echo "Risk Score: $SCORE"
if [ $SCORE -gt 100 ]; then
echo "Status: MALICIOUS"
elif [ $SCORE -gt 50 ]; then
echo "Status: SUSPICIOUS"
else
echo "Status: SAFE"
fi
Common Threat Patterns
Abschnitt betitelt „Common Threat Patterns“Malware Indicators
Abschnitt betitelt „Malware Indicators“# Check for typical malware patterns
echo "=== Malware Detection Signatures ==="
# Pattern 1: JavaScript + OpenAction (auto-execute)
pdfid.py document.pdf | grep "/JS" && \
pdfid.py document.pdf | grep "/OpenAction" && \
echo "THREAT: Auto-executing JavaScript detected"
# Pattern 2: Embedded executable
pdfid.py document.pdf | grep "/EmbeddedFile" && \
pdfid.py document.pdf | grep "/Launch" && \
echo "THREAT: Executable payload detected"
# Pattern 3: JBIG2 exploit
pdfid.py document.pdf | grep "/JBIG2Decode" && \
echo "THREAT: JBIG2 codec vulnerable to CVE-2008-5341"
# Pattern 4: Obfuscated objects
pdfid.py document.pdf | grep "/ObjStm" | awk '{print $1}' | \
awk '{if ($1 > 5) print "THREAT: Excessive object streams (obfuscation)"}'
Ransomware Delivery
Abschnitt betitelt „Ransomware Delivery“# Ransomware often uses these patterns
# Suspicious form submission
pdfid.py document.pdf | grep "/SubmitForm"
# External URL/callback
pdfid.py -a document.pdf | grep -i "http\|ftp\|url"
# Shellcode indicators
strings document.pdf | grep -E "shellcode|payload|exploit"
Troubleshooting
Abschnitt betitelt „Troubleshooting“Common Issues
Abschnitt betitelt „Common Issues“Permission Denied:
# Check file permissions
ls -la document.pdf
# Fix permissions
chmod 644 document.pdf
Corrupted PDF Detection:
# pdfid handles corrupted PDFs gracefully
pdfid.py corrupted.pdf
# Repair if possible
qpdf --repair corrupted.pdf fixed.pdf
pdfid.py fixed.pdf
False Positives:
# Some legitimate PDFs trigger alerts
# Verify with manual inspection
pdfid.py document.pdf -v
# Extract and review suspicious objects
pdf-parser.py document.pdf | less
Performance Optimization
Abschnitt betitelt „Performance Optimization“Batch Processing
Abschnitt betitelt „Batch Processing“# Process files in parallel
find . -name "*.pdf" -type f | xargs -P 4 -I {} pdfid.py {}
# With output to individual files
find . -name "*.pdf" -type f | while read pdf; do
pdfid.py "$pdf" > "${pdf%.pdf}.analysis"
done
Large-Scale Scanning
Abschnitt betitelt „Large-Scale Scanning“#!/bin/bash
# Efficient large-scale PDF scanning
time pdfid.py *.pdf 2>/dev/null | \
grep -E "Result|Encrypted|/JS|/Launch" | \
tee scan_results.txt | \
grep -i "malicious\|encrypted"
echo "Scan complete: $(date)"
Output Formats
Abschnitt betitelt „Output Formats“Text Output
Abschnitt betitelt „Text Output“pdfid.py document.pdf
# Standard human-readable output
JSON Output (if available)
Abschnitt betitelt „JSON Output (if available)“pdfid.py -j document.pdf
# Machine-readable JSON format
CSV Export
Abschnitt betitelt „CSV Export“# Generate CSV from multiple scans
for pdf in *.pdf; do
risk=$(pdfid.py "$pdf" | grep "Result" | awk '{print $NF}')
js=$(pdfid.py "$pdf" | grep "/JS" | awk '{print $1}')
echo "\"$pdf\",$risk,$js"
done > results.csv
Comparison with Alternatives
Abschnitt betitelt „Comparison with Alternatives“| Tool | PDF Scan | Malware Detection | Speed | Platform |
|---|---|---|---|---|
| pdfid | Yes | Yes | Fast | Cross-platform |
| pdf-parser | Limited | No | Moderate | Cross-platform |
| peepdf | Yes | Limited | Slow | Python |
| exiftool | Limited | No | Fast | Cross-platform |
Resources
Abschnitt betitelt „Resources“- Official Site: https://blog.didierstevens.com/programs/pdf-tools/
- GitHub: https://github.com/DidierStevens/DidierStevensSuite
- PDF Specs: https://www.adobe.io/open/standards/PDFRM.html
- Malware Samples: https://www.malware-traffic-analysis.net/
Legal and Ethical Considerations
Abschnitt betitelt „Legal and Ethical Considerations“Proper Use
Abschnitt betitelt „Proper Use“- Analyze PDFs from known sources only
- Use in isolated/sandbox environments for suspicious files
- Document analysis methodology
- Comply with local regulations
Caution
Abschnitt betitelt „Caution“- Do not open suspicious PDFs in standard viewers
- Use virtual machines for high-risk analysis
- Implement proper logging and documentation
- Follow organizational security policies