Robotstxt
Overview
Abschnitt betitelt „Overview“Robotstxt is a security reconnaissance tool that analyzes robots.txt files to identify sensitive paths, hidden directories, and restricted resources that websites attempt to hide from search engines and crawlers. During penetration testing and security assessments, robots.txt files often contain valuable intelligence about application structure, administrative paths, staging environments, and API endpoints.
Installation
Abschnitt betitelt „Installation“# Install via pip
pip install robotstxt
# Install via apt (Debian/Ubuntu)
sudo apt-get install robotstxt
# From source (GitHub)
git clone https://github.com/roycewilson/robotstxt.git
cd robotstxt
python setup.py install
# Verify installation
robotstxt --version
which robotstxt
Core Concepts
Abschnitt betitelt „Core Concepts“Why Robots.txt Matters in Security:
- Website owners disclose restricted directories they want crawlers to avoid
- Often contains paths to administrative panels, staging servers, and internal tools
- May reveal API endpoints, internal APIs, and sensitive data locations
- Can expose technology stack and backend structure
- Commonly forgotten and outdated - may reference long-deleted systems
Standard Robots.txt Structure:
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /staging/
Allow: /public/
Crawl-delay: 5
Request-rate: 30/1m
Basic Usage
Abschnitt betitelt „Basic Usage“| Command | Purpose |
|---|---|
robotstxt http://example.com | Fetch and display robots.txt |
robotstxt https://example.com | Fetch robots.txt over HTTPS |
robotstxt example.com | Auto-detect protocol and fetch |
robotstxt --help | Display help and options |
robotstxt --version | Show tool version |
Fetching Robots.txt Files
Abschnitt betitelt „Fetching Robots.txt Files“# Single URL
robotstxt http://example.com
# With verbose output
robotstxt -v http://example.com
# With timeout (seconds)
robotstxt --timeout 10 http://example.com
# From multiple domains
for domain in example.com test.com demo.com; do
echo "=== $domain ==="
robotstxt http://$domain
done
Parsing and Analysis
Abschnitt betitelt „Parsing and Analysis“| Command | Purpose |
|---|---|
robotstxt http://example.com --parse | Parse and display structured data |
robotstxt http://example.com --json | Output in JSON format |
robotstxt http://example.com --csv | Output in CSV format |
robotstxt http://example.com --xml | Output in XML format |
Extracting Disallowed Paths
Abschnitt betitelt „Extracting Disallowed Paths“# Display all disallowed paths
robotstxt http://example.com --disallow
# Extract disallowed paths only (for further analysis)
robotstxt http://example.com | grep -i "^Disallow"
# Get unique disallowed paths
robotstxt http://example.com --disallow | sort -u
User-Agent Specific Analysis
Abschnitt betitelt „User-Agent Specific Analysis“# Check specific user-agent rules
robotstxt http://example.com --user-agent "Googlebot"
robotstxt http://example.com --user-agent "Bingbot"
# Test multiple user-agents
for ua in "Googlebot" "Bingbot" "Yahoo! Slurp" "*"; do
echo "User-Agent: $ua"
robotstxt http://example.com --user-agent "$ua"
echo "---"
done
Crawl Delay and Request Rate Analysis
Abschnitt betitelt „Crawl Delay and Request Rate Analysis“# Display crawl delays
robotstxt http://example.com --crawl-delay
# Show request rates
robotstxt http://example.com --request-rate
# Full details including timing
robotstxt http://example.com --verbose
URL Checking
Abschnitt betitelt „URL Checking“# Test if URL is allowed by robots.txt
robotstxt http://example.com --test /admin/
robotstxt http://example.com --test /api/users
# Test multiple URLs
robotstxt http://example.com --test /admin/,/private/,/staging/
Output Formatting
Abschnitt betitelt „Output Formatting“# JSON output for programmatic parsing
robotstxt http://example.com --json | jq '.disallow[]'
# CSV for spreadsheet analysis
robotstxt http://example.com --csv > paths.csv
# Plain text for reports
robotstxt http://example.com > report.txt
# Combined output to file
robotstxt http://example.com --json --output results.json
Batch Processing Multiple Domains
Abschnitt betitelt „Batch Processing Multiple Domains“# Create domain list
cat > domains.txt << EOF
example.com
example.net
example.org
test.com
demo.com
EOF
# Process all domains
while read domain; do
echo "=== Analyzing $domain ==="
robotstxt http://$domain --json >> results.json
done < domains.txt
Automation Script
Abschnitt betitelt „Automation Script“#!/bin/bash
# Multi-domain robots.txt analyzer
DOMAINS="$1"
OUTPUT="robotstxt_results.json"
echo "{\"results\": [" > "$OUTPUT"
first=true
while read domain; do
if [ -z "$domain" ]; then continue; fi
if [ "$first" = true ]; then
first=false
else
echo "," >> "$OUTPUT"
fi
echo "Processing: $domain"
echo "{\"domain\": \"$domain\"," >> "$OUTPUT"
robotstxt http://$domain --json | jq '.' >> "$OUTPUT"
echo "}" >> "$OUTPUT"
done < "$DOMAINS"
echo "]}" >> "$OUTPUT"
echo "Results saved to $OUTPUT"
Integration with Other Tools
Abschnitt betitelt „Integration with Other Tools“Combine with CURL
Abschnitt betitelt „Combine with CURL“# Fetch robots.txt directly with curl
curl -s http://example.com/robots.txt | robotstxt --parse
# Store and analyze
curl -s http://example.com/robots.txt > robots.txt
robotstxt --file robots.txt --json
With Wget
Abschnitt betitelt „With Wget“# Download robots.txt
wget http://example.com/robots.txt
# Analyze
robotstxt --file robots.txt
Pipeline with Grep
Abschnitt betitelt „Pipeline with Grep“# Find paths containing "admin"
robotstxt http://example.com --disallow | grep -i admin
# Extract API endpoints
robotstxt http://example.com --disallow | grep -i api
# Find staging/test paths
robotstxt http://example.com --disallow | grep -iE 'staging|test|dev|debug'
Advanced Reconnaissance
Abschnitt betitelt „Advanced Reconnaissance“Finding Hidden Admin Panels
Abschnitt betitelt „Finding Hidden Admin Panels“# Common admin paths often in robots.txt
robotstxt http://example.com --disallow | grep -iE 'admin|panel|console|dashboard'
Discovering API Endpoints
Abschnitt betitelt „Discovering API Endpoints“# API paths frequently disclosed
robotstxt http://example.com --disallow | grep -iE '/api/|/v[0-9]+/|/service/|/rest/'
Staging Environment Discovery
Abschnitt betitelt „Staging Environment Discovery“# Staging and development environments
robotstxt http://example.com --disallow | grep -iE 'staging|dev|test|qa|sandbox'
Private Data Discovery
Abschnitt betitelt „Private Data Discovery“# Potentially sensitive directories
robotstxt http://example.com --disallow | grep -iE 'private|confidential|internal|secret|backup'
Data Analysis and Reporting
Abschnitt betitelt „Data Analysis and Reporting“# Count disallowed paths
robotstxt http://example.com --disallow | wc -l
# Identify most common path prefixes
robotstxt http://example.com --disallow | awk -F'/' '{print $2}' | sort | uniq -c | sort -rn
# Path depth analysis
robotstxt http://example.com --disallow | awk -F'/' '{print NF-1}' | sort | uniq -c
File-Based Analysis
Abschnitt betitelt „File-Based Analysis“# Analyze local robots.txt file
robotstxt --file ./robots.txt
# Compare two robots.txt files
diff <(robotstxt http://example.com --disallow) <(robotstxt http://backup.example.com --disallow)
# Historical comparison
robotstxt --file robots.txt.old --compare --file robots.txt.new
Output Examples
Abschnitt betitelt „Output Examples“JSON Output Structure
Abschnitt betitelt „JSON Output Structure“{
"domain": "example.com",
"user_agents": [
{
"pattern": "*",
"disallow": [
"/admin/",
"/private/",
"/staging/"
],
"allow": [
"/public/api/"
],
"crawl_delay": 5,
"request_rate": "30/1m"
}
],
"sitemaps": [
"https://example.com/sitemap.xml"
]
}
Practical Reconnaissance Scenarios
Abschnitt betitelt „Practical Reconnaissance Scenarios“Target Enumeration
Abschnitt betitelt „Target Enumeration“#!/bin/bash
# Enumerate target's robots.txt for sensitive paths
TARGET="$1"
OUTPUT="recon_${TARGET}.txt"
echo "=== Robots.txt Reconnaissance ===" > "$OUTPUT"
echo "Target: $TARGET" >> "$OUTPUT"
echo "Date: $(date)" >> "$OUTPUT"
echo "" >> "$OUTPUT"
echo "[+] Full robots.txt:" >> "$OUTPUT"
robotstxt http://$TARGET >> "$OUTPUT"
echo "" >> "$OUTPUT"
echo "[+] Disallowed Paths:" >> "$OUTPUT"
robotstxt http://$TARGET --disallow >> "$OUTPUT"
echo "" >> "$OUTPUT"
echo "[+] Potential Admin Paths:" >> "$OUTPUT"
robotstxt http://$TARGET --disallow | grep -iE 'admin|manage|control' >> "$OUTPUT"
echo "" >> "$OUTPUT"
echo "[+] API Endpoints:" >> "$OUTPUT"
robotstxt http://$TARGET --disallow | grep -i api >> "$OUTPUT"
cat "$OUTPUT"
Comparative Analysis
Abschnitt betitelt „Comparative Analysis“# Monitor robots.txt changes over time
robotstxt http://example.com --json > robots_$(date +%Y%m%d).json
# Compare with previous day
diff <(jq '.disallow' robots_$(date -d yesterday +%Y%m%d).json) \
<(jq '.disallow' robots_$(date +%Y%m%d).json)
Vulnerability Discovery
Abschnitt betitelt „Vulnerability Discovery“# Look for potentially misconfigured paths
robotstxt http://example.com --disallow | grep -iE \
'backup|cache|temp|tmp|log|debug|test|dev|staging|sandbox'
Integration with Security Tools
Abschnitt betitelt „Integration with Security Tools“With Burp Suite
Abschnitt betitelt „With Burp Suite“# Extract paths for URL rewriting in Burp
robotstxt http://example.com --disallow | \
sed 's|^|http://example.com|' > burp_urls.txt
With OWASP ZAP
Abschnitt betitelt „With OWASP ZAP“# Export for ZAP spider
robotstxt http://example.com --json | jq '.disallow[]' | \
sed 's/^/http:\/\/example.com/' > zap_urls.txt
Filtering and Refinement
Abschnitt betitelt „Filtering and Refinement“# Case-insensitive filtering
robotstxt http://example.com --disallow | grep -i pattern
# Exclude false positives
robotstxt http://example.com --disallow | grep -v '\.js\|\.css\|\.png'
# Combine conditions
robotstxt http://example.com --disallow | grep -iE 'admin|api' | grep -v robots
Performance and Optimization
Abschnitt betitelt „Performance and Optimization“# Parallel processing multiple domains
cat domains.txt | xargs -P 5 -I {} robotstxt http://{} --json
# Timeout handling for slow servers
robotstxt --timeout 5 http://example.com
# Retry logic
for attempt in {1..3}; do
robotstxt http://example.com && break
sleep 2
done
Troubleshooting
Abschnitt betitelt „Troubleshooting“# Connection issues
robotstxt http://example.com --verbose
# Timeout errors
robotstxt http://example.com --timeout 30
# Protocol selection
robotstxt http://example.com # HTTP
robotstxt https://example.com # HTTPS
# Port specification
robotstxt http://example.com:8080
robotstxt https://example.com:8443
Legal and Ethical Considerations
Abschnitt betitelt „Legal and Ethical Considerations“- Use robots.txt analysis only on targets you own or have explicit authorization to test
- Respect website terms of service and legal boundaries
- Document findings responsibly and securely
- Follow responsible disclosure practices
- Use information gathering for legitimate security assessments only
- Some jurisdictions may restrict this activity without authorization
Best Practices
Abschnitt betitelt „Best Practices“- Always verify authorization before reconnaissance
- Document all findings with timestamps
- Compare against known sensitive paths in your industry
- Monitor robots.txt changes over time for new disclosures
- Combine with other reconnaissance tools (WhoIs, DNS, Nmap)
- Cross-reference findings with sitemaps and other sources
- Use results to inform vulnerability scanning scope
Related Tools
Abschnitt betitelt „Related Tools“- Wget - HTTP client with robots.txt support
- Curl - Download and analyze robots.txt directly
- Burp Suite - Web application testing with path discovery
- OWASP ZAP - Security scanning and reconnaissance
- Sitemap Analyzers - Parse XML sitemaps
- Google Search Console - View indexed paths and disallowed URLs