pdfid

Overview

pdfid is a forensic analysis tool designed to scan PDF documents for suspicious elements, embedded code, and potential malware indicators. Rather than rendering PDFs (which could trigger exploits), pdfid analyzes the document structure to identify dangerous objects, JavaScript, Flash, and other security threats. It’s essential for security researchers, incident responders, and malware analysts.

Key Features:

Detect JavaScript and active content
Identify suspicious keywords and objects
Find embedded files (Flash, executables)
Analyze encryption and permissions
Generate risk scores
Flag known malware patterns
Cross-platform compatibility

Installation

Linux/Debian

# Install from repository
sudo apt-get install pdfid

# Or from source
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
# Copy pdfid.py to your PATH

# Verify installation
pdfid.py --version

macOS

# Using Homebrew
brew install pdfid

# Or from source
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
chmod +x pdfid.py
sudo cp pdfid.py /usr/local/bin/

Windows

# Download from official source
https://github.com/DidierStevens/DidierStevensSuite

# Requires Python 3
python pdfid.py document.pdf

# Add to PATH for convenience
setx PATH "%PATH%;C:\path\to\DidierStevensSuite"

Python Direct Installation

# Using pip
pip install pdfid

# Or clone and install
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
pip install -e .

Basic Usage

Quick PDF Scan

# Basic scan of PDF file
pdfid.py document.pdf

# Scan with verbose output
pdfid.py -v document.pdf

# Scan multiple files
pdfid.py *.pdf

# Output to file
pdfid.py document.pdf -o results.txt

Essential Commands

Command	Purpose
`pdfid.py file.pdf`	Basic PDF analysis
`pdfid.py -a file.pdf`	All analysis (detailed)
`pdfid.py -e file.pdf`	Entropy analysis
`pdfid.py -p file.pdf`	Peek at objects
`pdfid.py -r file.pdf`	Extended analysis
`pdfid.py -v file.pdf`	Verbose output

Output Interpretation

Standard Output Analysis

$ pdfid.py suspicious.pdf

PDFiD 0.2.8 https://blog.didierstevens.com/programs/pdf-tools/

Result: Likely malicious document

Summary
-------
PDF Header: %PDF-1.5
Comment: True
Updates: 0
Encrypted: False
Suspicious elements detected: 8

Count Name
----- ----
3 /JS
2 /Launch
1 /EmbeddedFile
1 /OpenAction
1 /AA
2 /ObjStm
0 /XRef
0 /Encrypt
0 /JBIG2Decode
1 /RichMedia
0 /Flash
0 /XFA
1 /Acroform

Element Dictionary

Element	Risk Level	Description
/JS	Critical	JavaScript code
/Launch	Critical	Launch external programs
/SubmitForm	High	Form submission (data exfil)
/EmbeddedFile	High	Embedded executables
/OpenAction	High	Auto-execute on open
/AA	High	Auto-action events
/JBIG2Decode	Critical	JBIG2 codec (exploits)
/RichMedia	High	Rich media content
/Flash	Critical	Embedded Flash
/XFA	Medium	XFA forms
/Acroform	Medium	Interactive forms
/ObjStm	Medium	Object streams

Detailed Analysis

Comprehensive Scanning

# Full analysis with all checks
pdfid.py -a document.pdf

# Include extended analysis
pdfid.py -e document.pdf

# Peek at suspicious objects
pdfid.py -p document.pdf

# Extract and display objects
pdfid.py -r document.pdf

JavaScript Detection

# Scan for JavaScript
pdfid.py document.pdf | grep -i "/JS"

# Show all JavaScript instances
pdfid.py -a document.pdf | grep -A 5 "/JS"

# Extract JavaScript code for analysis
pdfid.py -p document.pdf | grep -A 10 "stream"

Suspicious Object Analysis

# Check for launch/execution objects
pdfid.py document.pdf | grep -E "/Launch|/OpenAction|/AA"

# Detect embedded files
pdfid.py document.pdf | grep -E "/EmbeddedFile|/ObjStm"

# Check encryption status
pdfid.py document.pdf | grep -i "Encrypted"

Advanced Scanning Options

Entropy Analysis

# Calculate entropy of PDF objects
pdfid.py -e document.pdf

# Entropy indicates randomness/compression
# High entropy: possible encryption or obfuscation
# Low entropy: likely plain text or known patterns

# Look for suspicious entropy spikes
pdfid.py -e document.pdf | grep -E "entropy|0\.9|0\.8"

Object Stream Analysis

# Analyze object streams (often hide malware)
pdfid.py document.pdf | grep "/ObjStm"

# Count suspicious object streams
pdfid.py -a document.pdf | grep -c "/ObjStm"

# Detailed object stream inspection
pdfid.py -r document.pdf | grep -B 2 -A 5 "stream"

Obfuscation Detection

# Check for obfuscated content
pdfid.py document.pdf | grep -E "ObjStm|Filter|Encrypt"

# Detect encoding and compression
pdfid.py -a document.pdf | grep -E "FlateDecode|ASCII|Encrypt"

# Find intentionally hidden objects
pdfid.py -p document.pdf

Batch Analysis

Process Multiple Files

#!/bin/bash
# Scan all PDFs in directory

for pdf in *.pdf; do
    echo "=== $pdf ==="
    pdfid.py "$pdf" | grep -E "Result|Encrypted|/JS|/Launch"
done

Generate Risk Report

#!/bin/bash
# Create risk assessment report

OUTPUT="risk_report.txt"
echo "PDF Risk Assessment Report" > $OUTPUT
echo "Generated: $(date)" >> $OUTPUT
echo "" >> $OUTPUT

for pdf in *.pdf; do
    result=$(pdfid.py "$pdf" 2>&1)
    
    if echo "$result" | grep -q "malicious"; then
        echo "HIGH RISK: $pdf" >> $OUTPUT
        echo "$result" | grep -E "Count Name|-|/JS|/Launch|/AA" >> $OUTPUT
        echo "" >> $OUTPUT
    fi
done

cat $OUTPUT

Automated Threat Classification

#!/bin/bash
# Classify PDFs by threat level

SAFE_DIR="safe"
SUSPICIOUS_DIR="suspicious"
MALICIOUS_DIR="malicious"

mkdir -p $SAFE_DIR $SUSPICIOUS_DIR $MALICIOUS_DIR

for pdf in *.pdf; do
    result=$(pdfid.py "$pdf")
    
    js=$(echo "$result" | grep "/JS" | awk '{print $1}')
    launch=$(echo "$result" | grep "/Launch" | awk '{print $1}')
    
    if [ "$js" -gt 0 ] || [ "$launch" -gt 0 ]; then
        mv "$pdf" "$MALICIOUS_DIR/"
    elif echo "$result" | grep -q "ObjStm\|Encrypt"; then
        mv "$pdf" "$SUSPICIOUS_DIR/"
    else
        mv "$pdf" "$SAFE_DIR/"
    fi
done

Integration with Other Tools

Combine with pdf-parser

# Use pdfid to identify issues
pdfid.py suspicious.pdf

# Then use pdf-parser to examine objects
pdf-parser.py -a suspicious.pdf

# Extract specific objects
pdf-parser.py -o 5 suspicious.pdf

Workflow with pdf-triage

#!/bin/bash
# Multi-stage PDF security analysis

PDF="$1"

echo "1. Initial scan with pdfid"
pdfid.py "$PDF"

echo "2. Detailed structure analysis"
pdf-parser.py "$PDF" | head -50

echo "3. JavaScript extraction (if present)"
pdfid.py -p "$PDF" | grep -A 20 "/JS"

Virustotal Integration

#!/bin/bash
# Check PDF against VirusTotal

PDF="$1"

# First, scan locally with pdfid
echo "Local Analysis:"
pdfid.py "$PDF"

# Upload to VirusTotal (requires API key)
hash=$(sha256sum "$PDF" | awk '{print $1}')
curl "https://www.virustotal.com/api/v3/files/$hash" \
  -H "x-apikey: YOUR_API_KEY"

Risk Assessment Framework

Scoring System

# Risk assessment based on elements found

#!/bin/bash
PDF="$1"
SCORE=0

# Critical elements (50 points each)
JS=$(pdfid.py "$PDF" | grep "^[0-9]* /JS" | awk '{print $1}')
LAUNCH=$(pdfid.py "$PDF" | grep "^[0-9]* /Launch" | awk '{print $1}')
JBIG=$(pdfid.py "$PDF" | grep "^[0-9]* /JBIG2Decode" | awk '{print $1}')

SCORE=$((SCORE + JS*50 + LAUNCH*50 + JBIG*50))

# High risk elements (25 points each)
OPENACTION=$(pdfid.py "$PDF" | grep "^[0-9]* /OpenAction" | awk '{print $1}')
AA=$(pdfid.py "$PDF" | grep "^[0-9]* /AA" | awk '{print $1}')

SCORE=$((SCORE + OPENACTION*25 + AA*25))

echo "Risk Score: $SCORE"
if [ $SCORE -gt 100 ]; then
    echo "Status: MALICIOUS"
elif [ $SCORE -gt 50 ]; then
    echo "Status: SUSPICIOUS"
else
    echo "Status: SAFE"
fi

Common Threat Patterns

Malware Indicators

# Check for typical malware patterns
echo "=== Malware Detection Signatures ==="

# Pattern 1: JavaScript + OpenAction (auto-execute)
pdfid.py document.pdf | grep "/JS" && \
pdfid.py document.pdf | grep "/OpenAction" && \
echo "THREAT: Auto-executing JavaScript detected"

# Pattern 2: Embedded executable
pdfid.py document.pdf | grep "/EmbeddedFile" && \
pdfid.py document.pdf | grep "/Launch" && \
echo "THREAT: Executable payload detected"

# Pattern 3: JBIG2 exploit
pdfid.py document.pdf | grep "/JBIG2Decode" && \
echo "THREAT: JBIG2 codec vulnerable to CVE-2008-5341"

# Pattern 4: Obfuscated objects
pdfid.py document.pdf | grep "/ObjStm" | awk '{print $1}' | \
awk '{if ($1 > 5) print "THREAT: Excessive object streams (obfuscation)"}'

Ransomware Delivery

# Ransomware often uses these patterns

# Suspicious form submission
pdfid.py document.pdf | grep "/SubmitForm"

# External URL/callback
pdfid.py -a document.pdf | grep -i "http\|ftp\|url"

# Shellcode indicators
strings document.pdf | grep -E "shellcode|payload|exploit"

Troubleshooting

Common Issues

Permission Denied:

# Check file permissions
ls -la document.pdf

# Fix permissions
chmod 644 document.pdf

Corrupted PDF Detection:

# pdfid handles corrupted PDFs gracefully
pdfid.py corrupted.pdf

# Repair if possible
qpdf --repair corrupted.pdf fixed.pdf
pdfid.py fixed.pdf

False Positives:

# Some legitimate PDFs trigger alerts
# Verify with manual inspection
pdfid.py document.pdf -v

# Extract and review suspicious objects
pdf-parser.py document.pdf | less

Performance Optimization

Batch Processing

# Process files in parallel
find . -name "*.pdf" -type f | xargs -P 4 -I {} pdfid.py {}

# With output to individual files
find . -name "*.pdf" -type f | while read pdf; do
    pdfid.py "$pdf" > "${pdf%.pdf}.analysis"
done

Large-Scale Scanning

#!/bin/bash
# Efficient large-scale PDF scanning

time pdfid.py *.pdf 2>/dev/null | \
  grep -E "Result|Encrypted|/JS|/Launch" | \
  tee scan_results.txt | \
  grep -i "malicious\|encrypted"

echo "Scan complete: $(date)"

Output Formats

Text Output

pdfid.py document.pdf
# Standard human-readable output

JSON Output (if available)

pdfid.py -j document.pdf
# Machine-readable JSON format

CSV Export

# Generate CSV from multiple scans
for pdf in *.pdf; do
    risk=$(pdfid.py "$pdf" | grep "Result" | awk '{print $NF}')
    js=$(pdfid.py "$pdf" | grep "/JS" | awk '{print $1}')
    echo "\"$pdf\",$risk,$js"
done > results.csv

Comparison with Alternatives

Tool	PDF Scan	Malware Detection	Speed	Platform
pdfid	Yes	Yes	Fast	Cross-platform
pdf-parser	Limited	No	Moderate	Cross-platform
peepdf	Yes	Limited	Slow	Python
exiftool	Limited	No	Fast	Cross-platform

Resources

Official Site: https://blog.didierstevens.com/programs/pdf-tools/
GitHub: https://github.com/DidierStevens/DidierStevensSuite
PDF Specs: https://www.adobe.io/open/standards/PDFRM.html
Malware Samples: https://www.malware-traffic-analysis.net/

Legal and Ethical Considerations

Proper Use

Analyze PDFs from known sources only
Use in isolated/sandbox environments for suspicious files
Document analysis methodology
Comply with local regulations

Caution

Do not open suspicious PDFs in standard viewers
Use virtual machines for high-risk analysis
Implement proper logging and documentation
Follow organizational security policies