Ir al contenido

Metagoofil

Metagoofil is a metadata extraction tool used for OSINT reconnaissance, extracting hidden metadata from documents (PDF, Word, Excel, PowerPoint) to discover usernames, software versions, and email addresses from public documents.

Installation

Linux/Ubuntu

# Clone repository
git clone https://github.com/laramies/metagoofil.git
cd metagoofil

# Install dependencies
pip3 install -r requirements.txt

# Install exiftool (for metadata parsing)
sudo apt-get install exiftool

# Make executable
chmod +x metagoofil.py
sudo ln -s $(pwd)/metagoofil.py /usr/local/bin/metagoofil

macOS

# Install via Homebrew
brew install exiftool

# Clone and install
git clone https://github.com/laramies/metagoofil.git
cd metagoofil
pip3 install -r requirements.txt

Windows

# Install exiftool
choco install exiftool

# Or from Scoop
scoop install exiftool

# Clone and install
git clone https://github.com/laramies/metagoofil.git
cd metagoofil
pip3 install -r requirements.txt

Command-Line Options

OptionDescription
-d, --domain <DOMAIN>Target domain to search
-t, --file-type <TYPE>File type to search (pdf,doc,docx,xls,xlsx,ppt,pptx)
-l, --limit <NUM>Maximum results per file type (default: 100)
-n, --threads <NUM>Number of threads for downloading
-o, --output <FILE>Output file for results
-f, --format <FORMAT>Output format (html,pdf,txt)
-s, --search <ENGINE>Search engine (google, bing, yahoo)
-p, —proxy `HTTP proxy address
-u, --user-agent <UA>Custom User-Agent
-v, --verboseVerbose output
-r, --reportGenerate HTML report

Installation

Linux/Ubuntu

# Package manager installation
sudo apt update
sudo apt install metagoofil

# Alternative installation
chmod +x metagoofil-linux
sudo mv metagoofil-linux /usr/local/bin/metagoofil

# Build from source
cd metagoofil
make && sudo make install

macOS

# Homebrew installation
brew install metagoofil

# MacPorts installation
sudo port install metagoofil

# Manual installation
chmod +x metagoofil
sudo mv metagoofil /usr/local/bin/

Windows

# Chocolatey installation
choco install metagoofil

# Scoop installation
scoop install metagoofil

# Winget installation
winget install metagoofil

# Manual installation
# Extract and add to PATH

Basic Usage

Simple Domain Scan

# Scan target domain for all document types
python3 metagoofil.py -d target.com -t pdf,doc,docx,xls,xlsx,ppt,pptx

# Scan with limited results
python3 metagoofil.py -d target.com -t pdf -l 50

# Scan with custom threads
python3 metagoofil.py -d target.com -t docx -n 10

# Save results to file
python3 metagoofil.py -d target.com -t pdf -o results.html -r

Specific File Type Searches

# PDF documents only
python3 metagoofil.py -d target.com -t pdf

# Microsoft Word documents
python3 metagoofil.py -d target.com -t docx

# Excel spreadsheets
python3 metagoofil.py -d target.com -t xlsx

# PowerPoint presentations
python3 metagoofil.py -d target.com -t pptx

# All Microsoft Office formats
python3 metagoofil.py -d target.com -t "doc,docx,xls,xlsx,ppt,pptx"

Advanced Techniques

Multiple File Type Scanning

# Scan for multiple document types with limits
python3 metagoofil.py -d target.com -t pdf,doc,docx -l 100 -n 10

# Verbose output with custom thread count
python3 metagoofil.py -d target.com -t "xls,xlsx,ppt,pptx" -v -n 15

# Multiple domains
for domain in target1.com target2.com target3.com; do
  python3 metagoofil.py -d "$domain" -t pdf -o "${domain}_results.html"
done

Metadata Extraction

# Extract metadata from downloaded documents
exiftool *.pdf

# Extract specific metadata fields
exiftool -Author -Creator -Company *.pdf

# Extract from Word documents
exiftool *.docx

# Find all files with specific author
exiftool -filename -author *.pdf | grep -i "author"

# List all metadata
exiftool -a *.pdf | sort | uniq

Advanced Search and Analysis

# Search with custom User-Agent to bypass filtering
python3 metagoofil.py -d target.com -t pdf --user-agent "Mozilla/5.0"

# Use proxy for anonymous searching
python3 metagoofil.py -d target.com -t docx --proxy http://127.0.0.1:8080

# Search using different search engine
python3 metagoofil.py -d target.com -t pdf -s bing

# High-thread count for faster completion
python3 metagoofil.py -d target.com -t "pdf,doc,docx,xls,xlsx" -n 50

Report Generation

# Generate comprehensive HTML report
python3 metagoofil.py -d target.com -t pdf,doc,docx -r -o report.html

# Generate report with verbose output
python3 metagoofil.py -d target.com -t "xls,xlsx,ppt,pptx" -r -v

# Multiple domain reporting
python3 metagoofil.py -d company.com -t pdf -r -f html -o company_osint.html

Document Type Reference

Supported Formats

# PDF documents (common for reports, whitepapers)
# Contains: Creator software, Author, Creation date, Subject
python3 metagoofil.py -d target.com -t pdf

# Word documents (DOC, DOCX)
# Contains: Author, Company, Last saved by, Software version
python3 metagoofil.py -d target.com -t "doc,docx"

# Excel spreadsheets (XLS, XLSX)
# Contains: Author, Company, Last saved by, Comments
python3 metagoofil.py -d target.com -t "xls,xlsx"

# PowerPoint presentations (PPT, PPTX)
# Contains: Author, Company, Software, Creation tools
python3 metagoofil.py -d target.com -t "ppt,pptx"

Metadata Extraction Details

# Extract all metadata from PDFs
exiftool -a *.pdf

# Extract specific PDF metadata
exiftool -Title -Author -Creator -Subject -DateCreated *.pdf

# Extract Word document metadata
exiftool -DocTitle -Author -Company -LastSavedBy -Software *.docx

# Find creation software (version info)
exiftool -Producer -Creator -Software *.pdf

# Export metadata to CSV
exiftool -csv *.pdf > metadata.csv

# Filter for usernames and authors
exiftool -Author -LastSavedBy -Creator *.pdf | grep -v "^$"

Email and Username Harvesting

Harvesting Emails from Documents

# Extract emails from PDF metadata
exiftool -a *.pdf | grep -i "email\|@"

# Find all owner information in documents
exiftool -Author -Owner -Company *.pdf *.docx

# Extract from Word documents
strings *.docx | grep "@" | sort -u

# Combine all metadata into single file
exiftool -csv *.pdf *.docx *.xlsx > all_metadata.csv

# Parse for usernames (before @)
exiftool -Author *.pdf | grep -oE "[^@]+" | head -1

Username and Software Discovery

# Find all authors (potential usernames)
exiftool -Author *.pdf *.docx | grep -v "^$" | sort | uniq

# Discover software versions
exiftool -Creator -Producer *.pdf | grep -i "adobe\|microsoft\|openoffice"

# Find last editor information
exiftool -LastSavedBy *.docx

# Identify company names
exiftool -Company *.docx *.xlsx

# Extract creation timestamps
exiftool -CreateDate -ModifyDate *.pdf

Data Analysis and Filtering

# Export structured metadata
exiftool -csv *.pdf > documents.csv

# Filter high-value metadata
exiftool -a *.pdf | grep -E "Author:|Creator:|Producer:|Subject:"

# Find internal file paths (may reveal usernames/structure)
exiftool -a *.docx | grep -i "path\|directory"

# Look for version information
exiftool -All *.pdf | grep -i "version"

# Create summary report
exiftool -s -Author -Creator -Company *.pdf *.docx | sort | uniq -c | sort -rn

Real-World OSINT Scenarios

Complete Company Reconnaissance

# 1. Search for all document types
python3 metagoofil.py -d company.com -t pdf,doc,docx,xls,xlsx,ppt,pptx -l 100

# 2. Extract all metadata from results
exiftool -a *.pdf *.docx *.xlsx > company_metadata.txt

# 3. Parse for employees
grep -i "author\|creator\|lastmodifiedby" company_metadata.txt | sort | uniq

# 4. Look for internal structure
exiftool -a *.docx | grep -E "template|path|directory"

# 5. Identify software versions
exiftool -Creator -Producer *.pdf | grep -oE "[0-9]+\.[0-9]+" | sort | uniq

# 6. Generate final report
echo "Employees found:" && \
exiftool -Author *.docx *.pdf | grep -v "^$" | sort | uniq -c | sort -rn

Internal Network Mapping via Documents

# Search for technical documents
python3 metagoofil.py -d target.com -t pdf,docx -l 50

# Extract internal paths from documents
strings *.docx *.pdf | grep -E "\\\\Users\\\\|C:\\\\|/home/|/opt/" | sort | uniq

# Find references to internal systems
exiftool -a *.pdf | grep -i "server\|domain\|network"

# Identify software tools in use
exiftool -All *.docx | grep -i "software\|product\|version"

# Extract email patterns
exiftool -a *.pdf *.docx | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" | sort | uniq

Vulnerability Disclosure Research

# Find technical documentation
python3 metagoofil.py -d target.com -t pdf -l 200

# Search for version information in PDFs
exiftool -Producer -Creator *.pdf | grep -oE "Adobe|Microsoft|OpenOffice" | sort | uniq -c

# Look for creation dates (indicates age of documentation)
exiftool -CreateDate *.pdf | sort

# Find modification timestamps (active development indicators)
exiftool -ModifyDate *.docx | grep "$(date +%Y)" | wc -l

# Identify deprecated software versions
exiftool -All *.pdf | grep -i "version" | grep -E "2010|2012|2013" | wc -l

Employee Discovery and Profiling

# Comprehensive employee search
python3 metagoofil.py -d company.com -t "doc,docx,pdf" -l 100 -v

# Extract unique authors
exiftool -Author *.pdf *.docx | grep -v "^$" | sort | uniq > employees.txt

# Create username list from authors
cat employees.txt | tr ' ' '\n' | sort | uniq > potential_usernames.txt

# Identify departments by document type and author
exiftool -csv *.pdf *.docx | cut -d',' -f2,3 | grep -v "Author,FileName" | sort | uniq

# Find email addresses in document metadata
exiftool -All *.pdf *.docx | grep "@" | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+" | sort | uniq > emails.txt

Troubleshooting

Common Issues

Issue: No results found

# Verify domain syntax
python3 metagoofil.py -d target.com -t pdf -v

# Check if domain has public documents
google-chrome "site:target.com filetype:pdf"

# Try different search engines
python3 metagoofil.py -d target.com -t pdf -s bing
python3 metagoofil.py -d target.com -t pdf -s yahoo

# Increase result limit
python3 metagoofil.py -d target.com -t pdf -l 200

Issue: Slow download speed

# Increase thread count
python3 metagoofil.py -d target.com -t pdf -n 50

# Reduce file type scope
python3 metagoofil.py -d target.com -t pdf  # Single type is faster

# Use proxy for distributed access
python3 metagoofil.py -d target.com -t pdf --proxy http://proxy:8080

Issue: Exiftool not finding metadata

# Verify exiftool installation
exiftool -ver

# Check file format
file *.pdf

# Try alternative metadata tools
strings *.pdf | grep -i "author"

# Use alternative tool: pdfinfo
pdfinfo *.pdf | grep -i "author"

Issue: Permission errors

# Run with proper permissions
chmod +x metagoofil.py

# Create output directory if missing
mkdir -p ./documents

# Fix exiftool permissions
which exiftool
chmod +x /usr/bin/exiftool

Advanced Workflows

Automated OSINT Collection

#!/bin/bash
# Complete metadata extraction workflow

TARGET="target.com"
OUTPUT_DIR="osint_results_$(date +%Y%m%d)"
mkdir -p "$OUTPUT_DIR"

# 1. Search for all document types
echo "[*] Searching for documents on $TARGET..."
python3 metagoofil.py \
    -d "$TARGET" \
    -t pdf,doc,docx,xls,xlsx,ppt,pptx \
    -l 100 \
    -n 10 \
    -o "$OUTPUT_DIR/metagoofil_report.html" \
    -r

# 2. Extract metadata from all files
echo "[*] Extracting metadata..."
exiftool -csv documents/* > "$OUTPUT_DIR/all_metadata.csv"

# 3. Extract unique authors/creators
echo "[*] Harvesting users..."
exiftool -Author *.pdf *.docx | grep -v "^$" | sort | uniq > "$OUTPUT_DIR/users.txt"

# 4. Find emails
echo "[*] Harvesting emails..."
exiftool -All documents/* | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+" | sort | uniq > "$OUTPUT_DIR/emails.txt"

# 5. Find software versions
echo "[*] Identifying software..."
exiftool -Creator -Producer documents/* | grep -oE "[0-9]+\.[0-9]+" | sort | uniq > "$OUTPUT_DIR/versions.txt"

# 6. Generate summary
echo "[*] Creating summary report..."
cat > "$OUTPUT_DIR/summary.txt" << EOF
OSINT Summary for $TARGET
Generated: $(date)

Total Users Found: $(wc -l < "$OUTPUT_DIR/users.txt")
Total Emails Found: $(wc -l < "$OUTPUT_DIR/emails.txt")
Software Versions: $(cat "$OUTPUT_DIR/versions.txt")

See metagoofil_report.html for full details
EOF

echo "[+] Results saved to $OUTPUT_DIR/"

Integration with Password Lists

# Extract usernames for brute force attacks
python3 metagoofil.py -d target.com -t pdf,docx

# Create wordlist from discovered users
exiftool -Author *.pdf *.docx | grep -v "^$" | sed 's/ /_/g' | sort | uniq > discovered_users.txt

# Create email list for spraying
exiftool -All *.pdf *.docx | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+" | sort | uniq > emails_for_spray.txt

# Combine with password generator (e.g., CUPP)
python3 cupp.py -i -o passwords.txt

# Use with hydra for credential testing
hydra -L discovered_users.txt -P passwords.txt ssh://target.com

Batch Processing Multiple Domains

#!/bin/bash
# Process multiple targets

DOMAINS="target1.com target2.com target3.com"
DATE=$(date +%Y%m%d)
RESULTS_DIR="batch_osint_${DATE}"
mkdir -p "$RESULTS_DIR"

for domain in $DOMAINS; do
    echo "[*] Processing $domain..."

    # Run metagoofil
    python3 metagoofil.py \
        -d "$domain" \
        -t pdf,docx,xlsx \
        -l 50 \
        -n 10 \
        -o "$RESULTS_DIR/${domain}_report.html" \
        -r

    # Extract and compile results
    exiftool -csv documents/* >> "$RESULTS_DIR/all_metadata.csv"
    exiftool -Author documents/* >> "$RESULTS_DIR/all_users.txt"
done

# Create consolidated summary
sort "$RESULTS_DIR/all_users.txt" | uniq | tee "$RESULTS_DIR/summary_users.txt"

Comparison with Similar Tools

ToolPurposeStrengths
MetagoofilAutomated metadata extraction from public docsFast, multi-format, easy search
ExiftoolManual metadata parsingDetailed, cross-platform, comprehensive
pdfinfoPDF-specific metadataPDF-focused, lightweight
stringsBinary analysis of documentsRaw content extraction
fuxploiderWeb path + metadataIntegrated approach

Best Practices

OSINT Collection

  • Start with limited results (-l 50) and increase if needed
  • Use multiple file types to capture different document creation tools
  • Extract metadata regularly (tools updated frequently)
  • Save results with timestamps for tracking changes
  • Cross-reference multiple sources for validation
  • Use verbose mode during reconnaissance

Email and Username Harvesting

  • Focus on recent documents (within 1 year)
  • Verify usernames against company directory
  • Look for patterns (firstname.lastname, flastname)
  • Check for internal email domains
  • Validate emails against known corporate domains
  • Track software versions for potential vulnerabilities
  • Only search public, indexed documents
  • Respect robots.txt and site policies
  • Do not access documents requiring authentication
  • Obtain authorization before reconnaissance
  • Document all findings with timestamps
  • Follow responsible disclosure practices
  • Do not exploit discovered information without consent

Performance Optimization

  • Use threading appropriate for target (10-20 typical)
  • Search for file types separately if results are limited
  • Cache results for repeated queries
  • Use proxy rotation for large-scale searches
  • Monitor rate limiting indicators
  • Implement delays between domain searches

Reporting

HTML Report Generation

# Generate comprehensive HTML report
python3 metagoofil.py -d target.com -t pdf,docx -r -o report.html

# View report
firefox report.html
# or
open report.html  # macOS
# or
start report.html  # Windows

CSV Export Analysis

# Export all metadata to CSV
exiftool -csv *.pdf *.docx *.xlsx > metadata.csv

# View in spreadsheet
libreoffice metadata.csv

# Analyze with command line
awk -F',' '{print $2}' metadata.csv | sort | uniq -c | sort -rn

Creating Summary Reports

# Count documents by author
echo "=== Author Summary ===" && \
exiftool -Author *.pdf *.docx | grep -v "^$" | sort | uniq -c | sort -rn

# Count by creation date
echo "=== Document Age ===" && \
exiftool -CreateDate *.pdf | cut -d':' -f1 | sort | uniq -c

# Software identification
echo "=== Software Used ===" && \
exiftool -Creator *.pdf | sort | uniq -c | sort -rn

# Combined report
cat > report.txt << EOF
OSINT Report for $(date)
=======================
Total Documents: $(ls *.pdf *.docx 2>/dev/null | wc -l)
Unique Authors: $(exiftool -Author *.pdf *.docx 2>/dev/null | grep -v "^$" | sort | uniq | wc -l)
Email Addresses: $(exiftool -All *.pdf *.docx 2>/dev/null | grep -oE "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+" | sort | uniq | wc -l)
Software Versions Found: $(exiftool -Creator *.pdf 2>/dev/null | sort | uniq)
EOF
cat report.txt

Resources and References

Official Documentation


Last updated: 2025-03-30 | Metagoofil GitHub