Appearance
theHarvester Email and Subdomain Enumeration Tool Cheat Sheet
Overview
theHarvester is a powerful OSINT (Open Source Intelligence) tool designed for gathering email addresses, subdomain names, virtual hosts, open ports, banners, and employee names from different public sources. It's widely used by penetration testers, bug bounty hunters, and security researchers for reconnaissance and information gathering during the initial phases of security assessments.
⚠️ Legal Notice: Only use theHarvester on domains you own or have explicit permission to test. Unauthorized reconnaissance may violate terms of service and local laws.
Installation
Kali Linux Installation
bash
# theHarvester is pre-installed on Kali Linux
theharvester --help
# Update to latest version
sudo apt update
sudo apt install theharvester
# Alternative: Install from GitHub
git clone https://github.com/laramies/theHarvester.git
cd theHarvester
sudo python3 -m pip install -r requirements.txt
Ubuntu/Debian Installation
bash
# Install dependencies
sudo apt update
sudo apt install python3 python3-pip git
# Clone repository
git clone https://github.com/laramies/theHarvester.git
cd theHarvester
# Install Python dependencies
python3 -m pip install -r requirements.txt
# Make executable
chmod +x theHarvester.py
# Create symlink for global access
sudo ln -s $(pwd)/theHarvester.py /usr/local/bin/theharvester
Docker Installation
bash
# Pull official Docker image
docker pull theharvester/theharvester
# Run with Docker
docker run --rm theharvester/theharvester -d google -l 100 -b example.com
# Build from source
git clone https://github.com/laramies/theHarvester.git
cd theHarvester
docker build -t theharvester .
# Run custom build
docker run --rm theharvester -d google -l 100 -b example.com
Python Virtual Environment
bash
# Create virtual environment
python3 -m venv theharvester-env
source theharvester-env/bin/activate
# Clone and install
git clone https://github.com/laramies/theHarvester.git
cd theHarvester
pip install -r requirements.txt
# Run theHarvester
python3 theHarvester.py --help
Basic Usage
Command Structure
bash
# Basic syntax
theharvester -d <domain> -l <limit> -b <source>
# Common usage pattern
theharvester -d example.com -l 500 -b google
# Multiple sources
theharvester -d example.com -l 500 -b google,bing,yahoo
# Save results to file
theharvester -d example.com -l 500 -b google -f results.html
Essential Parameters
bash
# Domain to search
-d, --domain DOMAIN
# Limit number of results
-l, --limit LIMIT
# Data source to use
-b, --source SOURCE
# Output file
-f, --filename FILENAME
# Start result number
-s, --start START
# Enable DNS brute force
-c, --dns-brute
# Enable DNS TLD expansion
-t, --dns-tld
# Enable port scanning
-p, --port-scan
# Take screenshots
-e, --screenshot
Data Sources
Search Engines
bash
# Google search
theharvester -d example.com -l 500 -b google
# Bing search
theharvester -d example.com -l 500 -b bing
# Yahoo search
theharvester -d example.com -l 500 -b yahoo
# DuckDuckGo search
theharvester -d example.com -l 500 -b duckduckgo
# Yandex search
theharvester -d example.com -l 500 -b yandex
Social Networks
bash
# LinkedIn search
theharvester -d example.com -l 500 -b linkedin
# Twitter search
theharvester -d example.com -l 500 -b twitter
# Instagram search
theharvester -d example.com -l 500 -b instagram
# Facebook search
theharvester -d example.com -l 500 -b facebook
Professional Databases
bash
# Hunter.io (requires API key)
theharvester -d example.com -l 500 -b hunter
# SecurityTrails (requires API key)
theharvester -d example.com -l 500 -b securitytrails
# Shodan (requires API key)
theharvester -d example.com -l 500 -b shodan
# VirusTotal (requires API key)
theharvester -d example.com -l 500 -b virustotal
Certificate Transparency
bash
# Certificate Transparency logs
theharvester -d example.com -l 500 -b crtsh
# Censys (requires API key)
theharvester -d example.com -l 500 -b censys
# Certificate Spotter
theharvester -d example.com -l 500 -b certspotter
DNS Sources
bash
# DNS dumpster
theharvester -d example.com -l 500 -b dnsdumpster
# Threat Crowd
theharvester -d example.com -l 500 -b threatcrowd
# DNS brute force
theharvester -d example.com -l 500 -b google -c
# TLD expansion
theharvester -d example.com -l 500 -b google -t
Advanced Techniques
Comprehensive Reconnaissance
bash
#!/bin/bash
# comprehensive-recon.sh
DOMAIN="$1"
OUTPUT_DIR="theharvester_results_$(date +%Y%m%d_%H%M%S)"
if [ $# -ne 1 ]; then
echo "Usage: $0 <domain>"
exit 1
fi
mkdir -p "$OUTPUT_DIR"
echo "Starting comprehensive reconnaissance for $DOMAIN"
# Search engines
echo "=== Search Engines ==="
theharvester -d "$DOMAIN" -l 500 -b google -f "$OUTPUT_DIR/google.html"
theharvester -d "$DOMAIN" -l 500 -b bing -f "$OUTPUT_DIR/bing.html"
theharvester -d "$DOMAIN" -l 500 -b yahoo -f "$OUTPUT_DIR/yahoo.html"
# Social networks
echo "=== Social Networks ==="
theharvester -d "$DOMAIN" -l 500 -b linkedin -f "$OUTPUT_DIR/linkedin.html"
theharvester -d "$DOMAIN" -l 500 -b twitter -f "$OUTPUT_DIR/twitter.html"
# Certificate transparency
echo "=== Certificate Transparency ==="
theharvester -d "$DOMAIN" -l 500 -b crtsh -f "$OUTPUT_DIR/crtsh.html"
# DNS sources
echo "=== DNS Sources ==="
theharvester -d "$DOMAIN" -l 500 -b dnsdumpster -f "$OUTPUT_DIR/dnsdumpster.html"
# DNS brute force
echo "=== DNS Brute Force ==="
theharvester -d "$DOMAIN" -l 500 -b google -c -f "$OUTPUT_DIR/dns_brute.html"
# All sources combined
echo "=== All Sources Combined ==="
theharvester -d "$DOMAIN" -l 1000 -b all -f "$OUTPUT_DIR/all_sources.html"
echo "Reconnaissance complete. Results saved in $OUTPUT_DIR"
API Key Configuration
bash
# Create API keys configuration file
cat > api-keys.yaml << 'EOF'
apikeys:
hunter: your_hunter_api_key
securitytrails: your_securitytrails_api_key
shodan: your_shodan_api_key
virustotal: your_virustotal_api_key
censys:
id: your_censys_id
secret: your_censys_secret
binaryedge: your_binaryedge_api_key
fullhunt: your_fullhunt_api_key
github: your_github_token
EOF
# Use configuration file
theharvester -d example.com -l 500 -b hunter --api-keys api-keys.yaml
Email Pattern Analysis
python
#!/usr/bin/env python3
# email-pattern-analyzer.py
import re
import sys
from collections import Counter
def analyze_email_patterns(emails):
"""Analyze email patterns to identify naming conventions"""
patterns = []
domains = []
for email in emails:
if '@' in email:
local, domain = email.split('@', 1)
domains.append(domain.lower())
# Analyze local part patterns
if '.' in local:
if len(local.split('.')) == 2:
patterns.append('firstname.lastname')
else:
patterns.append('complex.pattern')
elif '_' in local:
patterns.append('firstname_lastname')
elif any(char.isdigit() for char in local):
patterns.append('name_with_numbers')
else:
patterns.append('single_name')
return patterns, domains
def extract_names_from_emails(emails):
"""Extract potential names from email addresses"""
names = []
for email in emails:
if '@' in email:
local = email.split('@')[0]
# Remove numbers and special characters
clean_local = re.sub(r'[0-9_.-]', ' ', local)
# Split into potential name parts
parts = clean_local.split()
if len(parts) >= 2:
names.extend(parts)
return names
def main():
if len(sys.argv) != 2:
print("Usage: python3 email-pattern-analyzer.py <email_list_file>")
sys.exit(1)
email_file = sys.argv[1]
try:
with open(email_file, 'r') as f:
content = f.read()
# Extract emails using regex
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, content)
print(f"Found {len(emails)} email addresses")
print("\n=== Email Addresses ===")
for email in sorted(set(emails)):
print(email)
# Analyze patterns
patterns, domains = analyze_email_patterns(emails)
print("\n=== Email Patterns ===")
pattern_counts = Counter(patterns)
for pattern, count in pattern_counts.most_common():
print(f"{pattern}: {count}")
print("\n=== Domains ===")
domain_counts = Counter(domains)
for domain, count in domain_counts.most_common():
print(f"{domain}: {count}")
# Extract names
names = extract_names_from_emails(emails)
if names:
print("\n=== Potential Names ===")
name_counts = Counter(names)
for name, count in name_counts.most_common(20):
if len(name) > 2: # Filter out short strings
print(f"{name}: {count}")
except FileNotFoundError:
print(f"Error: File {email_file} not found")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
main()
Subdomain Validation
bash
#!/bin/bash
# subdomain-validator.sh
DOMAIN="$1"
SUBDOMAIN_FILE="$2"
if [ $# -ne 2 ]; then
echo "Usage: $0 <domain> <subdomain_file>"
exit 1
fi
echo "Validating subdomains for $DOMAIN"
# Extract subdomains from theHarvester results
grep -oE "[a-zA-Z0-9.-]+\.$DOMAIN" "$SUBDOMAIN_FILE" | sort -u > temp_subdomains.txt
# Validate each subdomain
while read subdomain; do
if [ -n "$subdomain" ]; then
echo -n "Checking $subdomain: "
# DNS resolution check
if nslookup "$subdomain" >/dev/null 2>&1; then
echo -n "DNS✓ "
# HTTP check
if curl -s --connect-timeout 5 "http://$subdomain" >/dev/null 2>&1; then
echo "HTTP✓"
elif curl -s --connect-timeout 5 "https://$subdomain" >/dev/null 2>&1; then
echo "HTTPS✓"
else
echo "No HTTP"
fi
else
echo "DNS✗"
fi
fi
done < temp_subdomains.txt
rm temp_subdomains.txt
Integration with Other Tools
Integration with Nmap
bash
#!/bin/bash
# theharvester-nmap-integration.sh
DOMAIN="$1"
if [ $# -ne 1 ]; then
echo "Usage: $0 <domain>"
exit 1
fi
# Gather subdomains with theHarvester
echo "Gathering subdomains with theHarvester..."
theharvester -d "$DOMAIN" -l 500 -b all -f harvester_results.html
# Extract IP addresses and subdomains
grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' harvester_results.html | sort -u > ips.txt
grep -oE "[a-zA-Z0-9.-]+\.$DOMAIN" harvester_results.html | sort -u > subdomains.txt
# Scan discovered IPs with Nmap
if [ -s ips.txt ]; then
echo "Scanning discovered IPs with Nmap..."
nmap -sS -O -sV -oA nmap_ips -iL ips.txt
fi
# Resolve subdomains and scan
if [ -s subdomains.txt ]; then
echo "Resolving and scanning subdomains..."
while read subdomain; do
ip=$(dig +short "$subdomain" | head -1)
if [ -n "$ip" ] && [[ "$ip" =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
echo "$ip $subdomain" >> resolved_hosts.txt
fi
done < subdomains.txt
if [ -s resolved_hosts.txt ]; then
nmap -sS -sV -oA nmap_subdomains -iL resolved_hosts.txt
fi
fi
echo "Integration complete. Check nmap_*.xml files for results."
Integration with Metasploit
bash
#!/bin/bash
# theharvester-metasploit-integration.sh
DOMAIN="$1"
WORKSPACE="$2"
if [ $# -ne 2 ]; then
echo "Usage: $0 <domain> <workspace>"
exit 1
fi
# Run theHarvester
theharvester -d "$DOMAIN" -l 500 -b all -f harvester_results.html
# Extract emails and hosts
grep -oE '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' harvester_results.html > emails.txt
grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' harvester_results.html | sort -u > hosts.txt
# Create Metasploit resource script
cat > metasploit_import.rc << EOF
workspace -a $WORKSPACE
workspace $WORKSPACE
# Import hosts
$(while read host; do echo "hosts -a $host"; done < hosts.txt)
# Import emails as notes
$(while read email; do echo "notes -a -t email -d \"$email\" -H $DOMAIN"; done < emails.txt)
# Run auxiliary modules
use auxiliary/gather/dns_enum
set DOMAIN $DOMAIN
run
use auxiliary/scanner/http/http_version
set RHOSTS file:hosts.txt
run
workspace
hosts
notes
EOF
echo "Metasploit resource script created: metasploit_import.rc"
echo "Run with: msfconsole -r metasploit_import.rc"
Integration with Recon-ng
python
#!/usr/bin/env python3
# theharvester-recon-ng-integration.py
import subprocess
import re
import json
class TheHarvesterReconIntegration:
def __init__(self, domain):
self.domain = domain
self.results = {
'emails': [],
'subdomains': [],
'ips': [],
'social_profiles': []
}
def run_theharvester(self):
"""Run theHarvester and parse results"""
try:
# Run theHarvester with multiple sources
cmd = ['theharvester', '-d', self.domain, '-l', '500', '-b', 'all']
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
self.parse_results(result.stdout)
else:
print(f"theHarvester error: {result.stderr}")
except Exception as e:
print(f"Error running theHarvester: {e}")
def parse_results(self, output):
"""Parse theHarvester output"""
# Extract emails
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
self.results['emails'] = list(set(re.findall(email_pattern, output)))
# Extract IPs
ip_pattern = r'([0-9]{1,3}\.){3}[0-9]{1,3}'
self.results['ips'] = list(set(re.findall(ip_pattern, output)))
# Extract subdomains
subdomain_pattern = rf'[a-zA-Z0-9.-]+\.{re.escape(self.domain)}'
self.results['subdomains'] = list(set(re.findall(subdomain_pattern, output)))
def generate_recon_ng_commands(self):
"""Generate Recon-ng commands"""
commands = [
f"workspaces create {self.domain}",
f"workspaces select {self.domain}",
]
# Add domains
commands.append(f"db insert domains {self.domain}")
for subdomain in self.results['subdomains']:
commands.append(f"db insert domains {subdomain}")
# Add hosts
for ip in self.results['ips']:
commands.append(f"db insert hosts {ip}")
# Add contacts (emails)
for email in self.results['emails']:
local, domain = email.split('@', 1)
commands.extend([
f"db insert contacts {local} {local} {email}",
f"db insert domains {domain}"
])
# Add reconnaissance modules
commands.extend([
"modules load recon/domains-hosts/hackertarget",
"run",
"modules load recon/domains-hosts/threatcrowd",
"run",
"modules load recon/hosts-ports/shodan_hostname",
"run"
])
return commands
def save_recon_ng_script(self, filename="recon_ng_commands.txt"):
"""Save Recon-ng commands to file"""
commands = self.generate_recon_ng_commands()
with open(filename, 'w') as f:
for cmd in commands:
f.write(cmd + '\n')
print(f"Recon-ng commands saved to {filename}")
print(f"Run with: recon-ng -r {filename}")
def export_json(self, filename="theharvester_results.json"):
"""Export results to JSON"""
with open(filename, 'w') as f:
json.dump(self.results, f, indent=2)
print(f"Results exported to {filename}")
def main():
import sys
if len(sys.argv) != 2:
print("Usage: python3 theharvester-recon-ng-integration.py <domain>")
sys.exit(1)
domain = sys.argv[1]
integration = TheHarvesterReconIntegration(domain)
integration.run_theharvester()
integration.save_recon_ng_script()
integration.export_json()
print(f"\nResults Summary:")
print(f"Emails: {len(integration.results['emails'])}")
print(f"Subdomains: {len(integration.results['subdomains'])}")
print(f"IPs: {len(integration.results['ips'])}")
if __name__ == "__main__":
main()
Automation and Scripting
Automated Monitoring
bash
#!/bin/bash
# theharvester-monitor.sh
DOMAIN="$1"
INTERVAL="$2" # in hours
ALERT_EMAIL="$3"
if [ $# -ne 3 ]; then
echo "Usage: $0 <domain> <interval_hours> <alert_email>"
exit 1
fi
BASELINE_FILE="baseline_${DOMAIN}.txt"
CURRENT_FILE="current_${DOMAIN}.txt"
# Create baseline if it doesn't exist
if [ ! -f "$BASELINE_FILE" ]; then
echo "Creating baseline for $DOMAIN"
theharvester -d "$DOMAIN" -l 500 -b all > "$BASELINE_FILE"
fi
while true; do
echo "$(date): Monitoring $DOMAIN"
# Run current scan
theharvester -d "$DOMAIN" -l 500 -b all > "$CURRENT_FILE"
# Compare with baseline
if ! diff -q "$BASELINE_FILE" "$CURRENT_FILE" >/dev/null; then
echo "Changes detected for $DOMAIN"
# Generate diff report
diff "$BASELINE_FILE" "$CURRENT_FILE" > "changes_${DOMAIN}_$(date +%Y%m%d_%H%M%S).txt"
# Send alert email
if command -v mail >/dev/null; then
echo "New information discovered for $DOMAIN" | mail -s "theHarvester Alert: $DOMAIN" "$ALERT_EMAIL"
fi
# Update baseline
cp "$CURRENT_FILE" "$BASELINE_FILE"
fi
# Wait for next interval
sleep $((INTERVAL * 3600))
done
Batch Domain Processing
python
#!/usr/bin/env python3
# batch-domain-processor.py
import subprocess
import threading
import time
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
class BatchDomainProcessor:
def __init__(self, max_workers=5):
self.max_workers = max_workers
self.results = {}
def process_domain(self, domain, sources=['google', 'bing', 'crtsh']):
"""Process a single domain"""
try:
print(f"Processing {domain}...")
# Create output directory
output_dir = f"results_{domain}_{int(time.time())}"
os.makedirs(output_dir, exist_ok=True)
results = {}
for source in sources:
try:
output_file = f"{output_dir}/{source}.html"
cmd = [
'theharvester',
'-d', domain,
'-l', '500',
'-b', source,
'-f', output_file
]
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=300 # 5 minute timeout
)
if result.returncode == 0:
results[source] = {
'status': 'success',
'output_file': output_file
}
else:
results[source] = {
'status': 'error',
'error': result.stderr
}
except subprocess.TimeoutExpired:
results[source] = {
'status': 'timeout',
'error': 'Command timed out'
}
except Exception as e:
results[source] = {
'status': 'error',
'error': str(e)
}
self.results[domain] = results
print(f"Completed {domain}")
except Exception as e:
print(f"Error processing {domain}: {e}")
self.results[domain] = {'error': str(e)}
def process_domains(self, domains, sources=['google', 'bing', 'crtsh']):
"""Process multiple domains concurrently"""
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
futures = {
executor.submit(self.process_domain, domain, sources): domain
for domain in domains
}
for future in as_completed(futures):
domain = futures[future]
try:
future.result()
except Exception as e:
print(f"Error processing {domain}: {e}")
def generate_summary_report(self, output_file="batch_summary.txt"):
"""Generate summary report"""
with open(output_file, 'w') as f:
f.write("theHarvester Batch Processing Summary\n")
f.write("=" * 40 + "\n\n")
for domain, results in self.results.items():
f.write(f"Domain: {domain}\n")
if 'error' in results:
f.write(f" Error: {results['error']}\n")
else:
for source, result in results.items():
f.write(f" {source}: {result['status']}\n")
if result['status'] == 'error':
f.write(f" Error: {result['error']}\n")
f.write("\n")
print(f"Summary report saved to {output_file}")
def main():
import sys
if len(sys.argv) != 2:
print("Usage: python3 batch-domain-processor.py <domain_list_file>")
sys.exit(1)
domain_file = sys.argv[1]
try:
with open(domain_file, 'r') as f:
domains = [line.strip() for line in f if line.strip()]
processor = BatchDomainProcessor(max_workers=3)
print(f"Processing {len(domains)} domains...")
processor.process_domains(domains)
processor.generate_summary_report()
print("Batch processing complete!")
except FileNotFoundError:
print(f"Error: File {domain_file} not found")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
main()
Best Practices
Reconnaissance Methodology
text
1. Passive Information Gathering:
- Start with search engines (Google, Bing)
- Use certificate transparency logs
- Check social media platforms
- Avoid direct contact with target
2. Source Diversification:
- Use multiple data sources
- Cross-reference findings
- Validate discovered information
- Document source reliability
3. Rate Limiting:
- Respect API rate limits
- Use delays between requests
- Rotate IP addresses if needed
- Monitor for blocking
4. Data Validation:
- Verify email addresses exist
- Check subdomain resolution
- Validate IP address ownership
- Confirm social media profiles
Operational Security
bash
#!/bin/bash
# opsec-checklist.sh
echo "theHarvester OPSEC Checklist"
echo "============================"
echo "1. Network Security:"
echo " □ Use VPN or proxy"
echo " □ Rotate IP addresses"
echo " □ Monitor for rate limiting"
echo " □ Use different user agents"
echo -e "\n2. Data Handling:"
echo " □ Encrypt stored results"
echo " □ Use secure file permissions"
echo " □ Delete temporary files"
echo " □ Secure API keys"
echo -e "\n3. Legal Compliance:"
echo " □ Verify authorization scope"
echo " □ Respect terms of service"
echo " □ Document activities"
echo " □ Follow local laws"
echo -e "\n4. Technical Measures:"
echo " □ Use isolated environment"
echo " □ Monitor system logs"
echo " □ Validate SSL certificates"
echo " □ Check for detection"
Troubleshooting
Common Issues
bash
# Issue: API rate limiting
# Solution: Use API keys and implement delays
theharvester -d example.com -l 100 -b google --delay 2
# Issue: No results from certain sources
# Check if source is available
theharvester -d example.com -l 10 -b google -v
# Issue: SSL certificate errors
# Disable SSL verification (use with caution)
export PYTHONHTTPSVERIFY=0
# Issue: Timeout errors
# Increase timeout values in source code
# Or use smaller result limits
theharvester -d example.com -l 50 -b google
Debug Mode
bash
# Enable verbose output
theharvester -d example.com -l 100 -b google -v
# Check available sources
theharvester -h | grep -A 20 "sources:"
# Test specific source
theharvester -d google.com -l 10 -b google
# Check API key configuration
cat ~/.theHarvester/api-keys.yaml
Performance Optimization
bash
# Use specific sources instead of 'all'
theharvester -d example.com -l 500 -b google,bing,crtsh
# Limit results for faster execution
theharvester -d example.com -l 100 -b google
# Use parallel processing for multiple domains
parallel -j 3 theharvester -d {} -l 500 -b google ::: domain1.com domain2.com domain3.com
# Cache DNS results
export PYTHONDONTWRITEBYTECODE=1
Resources
- theHarvester GitHub Repository
- theHarvester Documentation
- OSINT Framework
- OWASP Testing Guide
- Penetration Testing Execution Standard
This cheat sheet provides comprehensive guidance for using theHarvester for OSINT and reconnaissance activities. Always ensure proper authorization and legal compliance before conducting any information gathering activities.