Metagoofil
Metagoofil is a metadata extraction tool used for OSINT reconnaissance, extracting hidden metadata from documents (PDF, Word, Excel, PowerPoint) to discover usernames, software versions, and email addresses from public documents.
Installation
Linux/Ubuntu
# Clone repository
git clone https://github.com/laramies/metagoofil.git
cd metagoofil
# Install dependencies
pip3 install -r requirements.txt
# Install exiftool (for metadata parsing)
sudo apt-get install exiftool
# Make executable
chmod +x metagoofil.py
sudo ln -s $(pwd)/metagoofil.py /usr/local/bin/metagoofil
macOS
# Install via Homebrew
brew install exiftool
# Clone and install
git clone https://github.com/laramies/metagoofil.git
cd metagoofil
pip3 install -r requirements.txt
Windows
# Install exiftool
choco install exiftool
# Or from Scoop
scoop install exiftool
# Clone and install
git clone https://github.com/laramies/metagoofil.git
cd metagoofil
pip3 install -r requirements.txt
Command-Line Options
| Option | Description |
|---|---|
-d, --domain <DOMAIN> | Target domain to search |
-t, --file-type <TYPE> | File type to search (pdf,doc,docx,xls,xlsx,ppt,pptx) |
-l, --limit <NUM> | Maximum results per file type (default: 100) |
-n, --threads <NUM> | Number of threads for downloading |
-o, --output <FILE> | Output file for results |
-f, --format <FORMAT> | Output format (html,pdf,txt) |
-s, --search <ENGINE> | Search engine (google, bing, yahoo) |
| -p, —proxy | HTTP proxy address |
-u, --user-agent <UA> | Custom User-Agent |
-v, --verbose | Verbose output |
-r, --report | Generate HTML report |
Installation
Linux/Ubuntu
# Package manager installation
sudo apt update
sudo apt install metagoofil
# Alternative installation
chmod +x metagoofil-linux
sudo mv metagoofil-linux /usr/local/bin/metagoofil
# Build from source
cd metagoofil
make && sudo make install
macOS
# Homebrew installation
brew install metagoofil
# MacPorts installation
sudo port install metagoofil
# Manual installation
chmod +x metagoofil
sudo mv metagoofil /usr/local/bin/
Windows
# Chocolatey installation
choco install metagoofil
# Scoop installation
scoop install metagoofil
# Winget installation
winget install metagoofil
# Manual installation
# Extract and add to PATH
Basic Usage
Simple Domain Scan
# Scan target domain for all document types
python3 metagoofil.py -d target.com -t pdf,doc,docx,xls,xlsx,ppt,pptx
# Scan with limited results
python3 metagoofil.py -d target.com -t pdf -l 50
# Scan with custom threads
python3 metagoofil.py -d target.com -t docx -n 10
# Save results to file
python3 metagoofil.py -d target.com -t pdf -o results.html -r
Specific File Type Searches
# PDF documents only
python3 metagoofil.py -d target.com -t pdf
# Microsoft Word documents
python3 metagoofil.py -d target.com -t docx
# Excel spreadsheets
python3 metagoofil.py -d target.com -t xlsx
# PowerPoint presentations
python3 metagoofil.py -d target.com -t pptx
# All Microsoft Office formats
python3 metagoofil.py -d target.com -t "doc,docx,xls,xlsx,ppt,pptx"
Advanced Techniques
Multiple File Type Scanning
# Scan for multiple document types with limits
python3 metagoofil.py -d target.com -t pdf,doc,docx -l 100 -n 10
# Verbose output with custom thread count
python3 metagoofil.py -d target.com -t "xls,xlsx,ppt,pptx" -v -n 15
# Multiple domains
for domain in target1.com target2.com target3.com; do
python3 metagoofil.py -d "$domain" -t pdf -o "${domain}_results.html"
done
Metadata Extraction
# Extract metadata from downloaded documents
exiftool *.pdf
# Extract specific metadata fields
exiftool -Author -Creator -Company *.pdf
# Extract from Word documents
exiftool *.docx
# Find all files with specific author
exiftool -filename -author *.pdf | grep -i "author"
# List all metadata
exiftool -a *.pdf | sort | uniq
Advanced Search and Analysis
# Search with custom User-Agent to bypass filtering
python3 metagoofil.py -d target.com -t pdf --user-agent "Mozilla/5.0"
# Use proxy for anonymous searching
python3 metagoofil.py -d target.com -t docx --proxy http://127.0.0.1:8080
# Search using different search engine
python3 metagoofil.py -d target.com -t pdf -s bing
# High-thread count for faster completion
python3 metagoofil.py -d target.com -t "pdf,doc,docx,xls,xlsx" -n 50
Report Generation
# Generate comprehensive HTML report
python3 metagoofil.py -d target.com -t pdf,doc,docx -r -o report.html
# Generate report with verbose output
python3 metagoofil.py -d target.com -t "xls,xlsx,ppt,pptx" -r -v
# Multiple domain reporting
python3 metagoofil.py -d company.com -t pdf -r -f html -o company_osint.html
Document Type Reference
Supported Formats
# PDF documents (common for reports, whitepapers)
# Contains: Creator software, Author, Creation date, Subject
python3 metagoofil.py -d target.com -t pdf
# Word documents (DOC, DOCX)
# Contains: Author, Company, Last saved by, Software version
python3 metagoofil.py -d target.com -t "doc,docx"
# Excel spreadsheets (XLS, XLSX)
# Contains: Author, Company, Last saved by, Comments
python3 metagoofil.py -d target.com -t "xls,xlsx"
# PowerPoint presentations (PPT, PPTX)
# Contains: Author, Company, Software, Creation tools
python3 metagoofil.py -d target.com -t "ppt,pptx"
Metadata Extraction Details
# Extract all metadata from PDFs
exiftool -a *.pdf
# Extract specific PDF metadata
exiftool -Title -Author -Creator -Subject -DateCreated *.pdf
# Extract Word document metadata
exiftool -DocTitle -Author -Company -LastSavedBy -Software *.docx
# Find creation software (version info)
exiftool -Producer -Creator -Software *.pdf
# Export metadata to CSV
exiftool -csv *.pdf > metadata.csv
# Filter for usernames and authors
exiftool -Author -LastSavedBy -Creator *.pdf | grep -v "^$"
Email and Username Harvesting
Harvesting Emails from Documents
# Extract emails from PDF metadata
exiftool -a *.pdf | grep -i "email\|@"
# Find all owner information in documents
exiftool -Author -Owner -Company *.pdf *.docx
# Extract from Word documents
strings *.docx | grep "@" | sort -u
# Combine all metadata into single file
exiftool -csv *.pdf *.docx *.xlsx > all_metadata.csv
# Parse for usernames (before @)
exiftool -Author *.pdf | grep -oE "[^@]+" | head -1
Username and Software Discovery
# Find all authors (potential usernames)
exiftool -Author *.pdf *.docx | grep -v "^$" | sort | uniq
# Discover software versions
exiftool -Creator -Producer *.pdf | grep -i "adobe\|microsoft\|openoffice"
# Find last editor information
exiftool -LastSavedBy *.docx
# Identify company names
exiftool -Company *.docx *.xlsx
# Extract creation timestamps
exiftool -CreateDate -ModifyDate *.pdf
Data Analysis and Filtering
# Export structured metadata
exiftool -csv *.pdf > documents.csv
# Filter high-value metadata
exiftool -a *.pdf | grep -E "Author:|Creator:|Producer:|Subject:"
# Find internal file paths (may reveal usernames/structure)
exiftool -a *.docx | grep -i "path\|directory"
# Look for version information
exiftool -All *.pdf | grep -i "version"
# Create summary report
exiftool -s -Author -Creator -Company *.pdf *.docx | sort | uniq -c | sort -rn
Real-World OSINT Scenarios
Complete Company Reconnaissance
# 1. Search for all document types
python3 metagoofil.py -d company.com -t pdf,doc,docx,xls,xlsx,ppt,pptx -l 100
# 2. Extract all metadata from results
exiftool -a *.pdf *.docx *.xlsx > company_metadata.txt
# 3. Parse for employees
grep -i "author\|creator\|lastmodifiedby" company_metadata.txt | sort | uniq
# 4. Look for internal structure
exiftool -a *.docx | grep -E "template|path|directory"
# 5. Identify software versions
exiftool -Creator -Producer *.pdf | grep -oE "[0-9]+\.[0-9]+" | sort | uniq
# 6. Generate final report
echo "Employees found:" && \
exiftool -Author *.docx *.pdf | grep -v "^$" | sort | uniq -c | sort -rn
Internal Network Mapping via Documents
# Search for technical documents
python3 metagoofil.py -d target.com -t pdf,docx -l 50
# Extract internal paths from documents
strings *.docx *.pdf | grep -E "\\\\Users\\\\|C:\\\\|/home/|/opt/" | sort | uniq
# Find references to internal systems
exiftool -a *.pdf | grep -i "server\|domain\|network"
# Identify software tools in use
exiftool -All *.docx | grep -i "software\|product\|version"
# Extract email patterns
exiftool -a *.pdf *.docx | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" | sort | uniq
Vulnerability Disclosure Research
# Find technical documentation
python3 metagoofil.py -d target.com -t pdf -l 200
# Search for version information in PDFs
exiftool -Producer -Creator *.pdf | grep -oE "Adobe|Microsoft|OpenOffice" | sort | uniq -c
# Look for creation dates (indicates age of documentation)
exiftool -CreateDate *.pdf | sort
# Find modification timestamps (active development indicators)
exiftool -ModifyDate *.docx | grep "$(date +%Y)" | wc -l
# Identify deprecated software versions
exiftool -All *.pdf | grep -i "version" | grep -E "2010|2012|2013" | wc -l
Employee Discovery and Profiling
# Comprehensive employee search
python3 metagoofil.py -d company.com -t "doc,docx,pdf" -l 100 -v
# Extract unique authors
exiftool -Author *.pdf *.docx | grep -v "^$" | sort | uniq > employees.txt
# Create username list from authors
cat employees.txt | tr ' ' '\n' | sort | uniq > potential_usernames.txt
# Identify departments by document type and author
exiftool -csv *.pdf *.docx | cut -d',' -f2,3 | grep -v "Author,FileName" | sort | uniq
# Find email addresses in document metadata
exiftool -All *.pdf *.docx | grep "@" | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+" | sort | uniq > emails.txt
Troubleshooting
Common Issues
Issue: No results found
# Verify domain syntax
python3 metagoofil.py -d target.com -t pdf -v
# Check if domain has public documents
google-chrome "site:target.com filetype:pdf"
# Try different search engines
python3 metagoofil.py -d target.com -t pdf -s bing
python3 metagoofil.py -d target.com -t pdf -s yahoo
# Increase result limit
python3 metagoofil.py -d target.com -t pdf -l 200
Issue: Slow download speed
# Increase thread count
python3 metagoofil.py -d target.com -t pdf -n 50
# Reduce file type scope
python3 metagoofil.py -d target.com -t pdf # Single type is faster
# Use proxy for distributed access
python3 metagoofil.py -d target.com -t pdf --proxy http://proxy:8080
Issue: Exiftool not finding metadata
# Verify exiftool installation
exiftool -ver
# Check file format
file *.pdf
# Try alternative metadata tools
strings *.pdf | grep -i "author"
# Use alternative tool: pdfinfo
pdfinfo *.pdf | grep -i "author"
Issue: Permission errors
# Run with proper permissions
chmod +x metagoofil.py
# Create output directory if missing
mkdir -p ./documents
# Fix exiftool permissions
which exiftool
chmod +x /usr/bin/exiftool
Advanced Workflows
Automated OSINT Collection
#!/bin/bash
# Complete metadata extraction workflow
TARGET="target.com"
OUTPUT_DIR="osint_results_$(date +%Y%m%d)"
mkdir -p "$OUTPUT_DIR"
# 1. Search for all document types
echo "[*] Searching for documents on $TARGET..."
python3 metagoofil.py \
-d "$TARGET" \
-t pdf,doc,docx,xls,xlsx,ppt,pptx \
-l 100 \
-n 10 \
-o "$OUTPUT_DIR/metagoofil_report.html" \
-r
# 2. Extract metadata from all files
echo "[*] Extracting metadata..."
exiftool -csv documents/* > "$OUTPUT_DIR/all_metadata.csv"
# 3. Extract unique authors/creators
echo "[*] Harvesting users..."
exiftool -Author *.pdf *.docx | grep -v "^$" | sort | uniq > "$OUTPUT_DIR/users.txt"
# 4. Find emails
echo "[*] Harvesting emails..."
exiftool -All documents/* | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+" | sort | uniq > "$OUTPUT_DIR/emails.txt"
# 5. Find software versions
echo "[*] Identifying software..."
exiftool -Creator -Producer documents/* | grep -oE "[0-9]+\.[0-9]+" | sort | uniq > "$OUTPUT_DIR/versions.txt"
# 6. Generate summary
echo "[*] Creating summary report..."
cat > "$OUTPUT_DIR/summary.txt" << EOF
OSINT Summary for $TARGET
Generated: $(date)
Total Users Found: $(wc -l < "$OUTPUT_DIR/users.txt")
Total Emails Found: $(wc -l < "$OUTPUT_DIR/emails.txt")
Software Versions: $(cat "$OUTPUT_DIR/versions.txt")
See metagoofil_report.html for full details
EOF
echo "[+] Results saved to $OUTPUT_DIR/"
Integration with Password Lists
# Extract usernames for brute force attacks
python3 metagoofil.py -d target.com -t pdf,docx
# Create wordlist from discovered users
exiftool -Author *.pdf *.docx | grep -v "^$" | sed 's/ /_/g' | sort | uniq > discovered_users.txt
# Create email list for spraying
exiftool -All *.pdf *.docx | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+" | sort | uniq > emails_for_spray.txt
# Combine with password generator (e.g., CUPP)
python3 cupp.py -i -o passwords.txt
# Use with hydra for credential testing
hydra -L discovered_users.txt -P passwords.txt ssh://target.com
Batch Processing Multiple Domains
#!/bin/bash
# Process multiple targets
DOMAINS="target1.com target2.com target3.com"
DATE=$(date +%Y%m%d)
RESULTS_DIR="batch_osint_${DATE}"
mkdir -p "$RESULTS_DIR"
for domain in $DOMAINS; do
echo "[*] Processing $domain..."
# Run metagoofil
python3 metagoofil.py \
-d "$domain" \
-t pdf,docx,xlsx \
-l 50 \
-n 10 \
-o "$RESULTS_DIR/${domain}_report.html" \
-r
# Extract and compile results
exiftool -csv documents/* >> "$RESULTS_DIR/all_metadata.csv"
exiftool -Author documents/* >> "$RESULTS_DIR/all_users.txt"
done
# Create consolidated summary
sort "$RESULTS_DIR/all_users.txt" | uniq | tee "$RESULTS_DIR/summary_users.txt"
Comparison with Similar Tools
| Tool | Purpose | Strengths |
|---|---|---|
| Metagoofil | Automated metadata extraction from public docs | Fast, multi-format, easy search |
| Exiftool | Manual metadata parsing | Detailed, cross-platform, comprehensive |
| pdfinfo | PDF-specific metadata | PDF-focused, lightweight |
| strings | Binary analysis of documents | Raw content extraction |
| fuxploider | Web path + metadata | Integrated approach |
Best Practices
OSINT Collection
- Start with limited results (-l 50) and increase if needed
- Use multiple file types to capture different document creation tools
- Extract metadata regularly (tools updated frequently)
- Save results with timestamps for tracking changes
- Cross-reference multiple sources for validation
- Use verbose mode during reconnaissance
Email and Username Harvesting
- Focus on recent documents (within 1 year)
- Verify usernames against company directory
- Look for patterns (firstname.lastname, flastname)
- Check for internal email domains
- Validate emails against known corporate domains
- Track software versions for potential vulnerabilities
Legal and Ethical Considerations
- Only search public, indexed documents
- Respect robots.txt and site policies
- Do not access documents requiring authentication
- Obtain authorization before reconnaissance
- Document all findings with timestamps
- Follow responsible disclosure practices
- Do not exploit discovered information without consent
Performance Optimization
- Use threading appropriate for target (10-20 typical)
- Search for file types separately if results are limited
- Cache results for repeated queries
- Use proxy rotation for large-scale searches
- Monitor rate limiting indicators
- Implement delays between domain searches
Reporting
HTML Report Generation
# Generate comprehensive HTML report
python3 metagoofil.py -d target.com -t pdf,docx -r -o report.html
# View report
firefox report.html
# or
open report.html # macOS
# or
start report.html # Windows
CSV Export Analysis
# Export all metadata to CSV
exiftool -csv *.pdf *.docx *.xlsx > metadata.csv
# View in spreadsheet
libreoffice metadata.csv
# Analyze with command line
awk -F',' '{print $2}' metadata.csv | sort | uniq -c | sort -rn
Creating Summary Reports
# Count documents by author
echo "=== Author Summary ===" && \
exiftool -Author *.pdf *.docx | grep -v "^$" | sort | uniq -c | sort -rn
# Count by creation date
echo "=== Document Age ===" && \
exiftool -CreateDate *.pdf | cut -d':' -f1 | sort | uniq -c
# Software identification
echo "=== Software Used ===" && \
exiftool -Creator *.pdf | sort | uniq -c | sort -rn
# Combined report
cat > report.txt << EOF
OSINT Report for $(date)
=======================
Total Documents: $(ls *.pdf *.docx 2>/dev/null | wc -l)
Unique Authors: $(exiftool -Author *.pdf *.docx 2>/dev/null | grep -v "^$" | sort | uniq | wc -l)
Email Addresses: $(exiftool -All *.pdf *.docx 2>/dev/null | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+" | sort | uniq | wc -l)
Software Versions Found: $(exiftool -Creator *.pdf 2>/dev/null | sort | uniq)
EOF
cat report.txt
Resources and References
Official Documentation
Related Tools
- Exiftool - Detailed metadata extraction
- Google Dorking - Document discovery
- Shodan - Service discovery
Last updated: 2025-03-30 | Metagoofil GitHub