Parsero
Parsero is a specialized tool for parsing and analyzing robots.txt files from web applications. It extracts information about hidden paths, disallowed directories, and restricted endpoints that website administrators intended to hide from search engines, revealing potential attack surface during security assessments.
Installation
Seção intitulada “Installation”Linux Installation
Seção intitulada “Linux Installation”# Clone the repository
git clone https://github.com/behindthefirewalls/Parsero.git
cd Parsero
# Install dependencies
pip3 install -r requirements.txt
# Verify installation
python3 parsero.py --help
macOS Installation
Seção intitulada “macOS Installation”# Using Homebrew
brew tap behindthefirewalls/parsero
brew install parsero
# Or via pip3
pip3 install parsero
# Verify installation
parsero --help
Installation from PyPI
Seção intitulada “Installation from PyPI”# Install via pip3
pip3 install parsero
# Update to latest version
pip3 install --upgrade parsero
# Verify installation
python3 -m parsero --help
Manual Installation
Seção intitulada “Manual Installation”# Install Python 3.6+
python3 --version
# Install required packages
pip3 install requests
pip3 install urllib3
# Download and setup
git clone https://github.com/behindthefirewalls/Parsero.git
cd Parsero
chmod +x parsero.py
Core Concepts
Seção intitulada “Core Concepts”robots.txt Structure
Seção intitulada “robots.txt Structure”User-agent: Googlebot
Disallow: /admin
Disallow: /private
Allow: /public
User-agent: *
Disallow: /tmp
Crawl-delay: 5
User-agent: BadBot
Disallow: /
Information Types
Seção intitulada “Information Types”- Disallowed paths: Directories forbidden to search engines
- Allowed paths: Explicitly allowed despite parent restrictions
- User-agent rules: Specific bot directives
- Crawl delays: Rate-limiting hints
- Sitemaps: Reference to site structure files
Security Implications
Seção intitulada “Security Implications”- Reveals structure of sensitive areas
- Indicates hidden admin panels
- Shows private directories
- May expose test/staging environments
- Hints at API endpoints
- Reveals backup locations
Basic Usage
Seção intitulada “Basic Usage”Parse Single URL
Seção intitulada “Parse Single URL”# Parse robots.txt from website
python3 parsero.py -u http://example.com
# Verbose output
python3 parsero.py -u http://example.com -v
# Save results to file
python3 parsero.py -u http://example.com -o results.txt
Specify Different Port
Seção intitulada “Specify Different Port”# Non-standard HTTP port
python3 parsero.py -u http://example.com:8080
# HTTPS with custom port
python3 parsero.py -u https://example.com:8443
# Localhost testing
python3 parsero.py -u http://localhost:5000
Batch URL Processing
Seção intitulada “Batch URL Processing”# Parse multiple URLs from file
python3 parsero.py -u http://example1.com http://example2.com http://example3.com
# Read URLs from file
python3 parsero.py -f urls.txt
# Output to directory
python3 parsero.py -f urls.txt -o results_dir/
Output Formats
Seção intitulada “Output Formats”Standard Output
Seção intitulada “Standard Output”# Display results in terminal
python3 parsero.py -u http://example.com
# Verbose mode (detailed output)
python3 parsero.py -u http://example.com -v
# Very verbose
python3 parsero.py -u http://example.com -vv
File Output
Seção intitulada “File Output”# Save to text file
python3 parsero.py -u http://example.com -o robots_output.txt
# Append to existing file
python3 parsero.py -u http://example.com -o results.txt -a
# Output to specific directory
python3 parsero.py -u http://example.com -o /tmp/parsero_results/
CSV Export
Seção intitulada “CSV Export”# Export as CSV
python3 parsero.py -u http://example.com -f csv -o results.csv
# Multiple URLs to CSV
python3 parsero.py -f urls.txt -f csv -o all_results.csv
JSON Output
Seção intitulada “JSON Output”# Export as JSON
python3 parsero.py -u http://example.com -f json -o results.json
# Pretty JSON formatting
python3 parsero.py -u http://example.com -f json -p
Advanced Scanning Options
Seção intitulada “Advanced Scanning Options”Bypass robots.txt Restrictions
Seção intitulada “Bypass robots.txt Restrictions”# Download actual restricted files (for authorized testing)
python3 parsero.py -u http://example.com -b
# Aggressive scanning
python3 parsero.py -u http://example.com -a
# Deep crawl with discovered paths
python3 parsero.py -u http://example.com -d
Timeout and Retry Configuration
Seção intitulada “Timeout and Retry Configuration”# Set connection timeout
python3 parsero.py -u http://example.com -t 30
# Retry failed connections
python3 parsero.py -u http://example.com -r 3
# Adjust crawl delay
python3 parsero.py -u http://example.com --delay 2
Proxy Configuration
Seção intitulada “Proxy Configuration”# Use HTTP proxy
python3 parsero.py -u http://example.com --proxy http://proxy.example.com:8080
# Use HTTPS proxy
python3 parsero.py -u http://example.com --proxy https://proxy.example.com:8443
# Proxy with authentication
python3 parsero.py -u http://example.com --proxy http://user:pass@proxy.com:8080
Custom User-Agent
Seção intitulada “Custom User-Agent”# Specify custom user-agent
python3 parsero.py -u http://example.com --user-agent "CustomBot/1.0"
# Impersonate specific bot
python3 parsero.py -u http://example.com --user-agent "Googlebot/2.1"
# Use custom headers file
python3 parsero.py -u http://example.com -H headers.txt
Path Discovery
Seção intitulada “Path Discovery”Extract and List Paths
Seção intitulada “Extract and List Paths”# Extract all disallowed paths
python3 parsero.py -u http://example.com -o paths.txt
# List unique paths only
python3 parsero.py -u http://example.com | sort | uniq
# Count total paths found
python3 parsero.py -u http://example.com | wc -l
Analyze Path Patterns
Seção intitulada “Analyze Path Patterns”# Find paths containing keyword
python3 parsero.py -u http://example.com | grep admin
# Find all API endpoints
python3 parsero.py -u http://example.com | grep /api
# Identify sensitive paths
python3 parsero.py -u http://example.com | grep -E "(admin|private|backup|tmp)"
Filter Results
Seção intitulada “Filter Results”# Show only directories
python3 parsero.py -u http://example.com | grep '/$'
# Show only files
python3 parsero.py -u http://example.com | grep '\.'
# Exclude certain paths
python3 parsero.py -u http://example.com | grep -v '/search'
Reconnaissance Workflow
Seção intitulada “Reconnaissance Workflow”Multi-Target Reconnaissance
Seção intitulada “Multi-Target Reconnaissance”# Parse robots.txt from multiple sites
cat targets.txt | while read target; do
python3 parsero.py -u "$target" -o "results_${target##*/}.txt"
done
# Combine all results
cat results_*.txt > all_discovered_paths.txt
Competitive Intelligence
Seção intitulada “Competitive Intelligence”# Analyze competitor sites
python3 parsero.py -u http://competitor1.com -o competitor1.txt
python3 parsero.py -u http://competitor2.com -o competitor2.txt
# Compare discovered structures
diff competitor1.txt competitor2.txt
API Endpoint Discovery
Seção intitulada “API Endpoint Discovery”# Parse robots.txt looking for APIs
python3 parsero.py -u http://api.example.com
# Filter for API paths
python3 parsero.py -u http://example.com | grep -E "/api|/v1|/v2|/rest"
# Extract endpoint patterns
python3 parsero.py -u http://example.com | grep -oP '/api/[^?]*' | sort | uniq
Subdomain Reconnaissance
Seção intitulada “Subdomain Reconnaissance”# Check robots.txt on multiple subdomains
for sub in www api staging dev admin; do
echo "=== $sub.example.com ==="
python3 parsero.py -u http://$sub.example.com 2>/dev/null
done
# Save results
for sub in www api staging dev; do
python3 parsero.py -u http://$sub.example.com -o "$sub.txt"
done
Vulnerability Mapping
Seção intitulada “Vulnerability Mapping”Identify Information Disclosure
Seção intitulada “Identify Information Disclosure”# Find sensitive directories
python3 parsero.py -u http://example.com | grep -E "(backup|logs|config|private)"
# Identify admin panels
python3 parsero.py -u http://example.com | grep -i admin
# Find test/debug endpoints
python3 parsero.py -u http://example.com | grep -E "(test|debug|dev|staging)"
API Surface Mapping
Seção intitulada “API Surface Mapping”# Discover API structure
python3 parsero.py -u http://api.example.com -v
# Map API versions
python3 parsero.py -u http://example.com | grep -oP '/api/v[0-9]+'
# Identify deprecated APIs
python3 parsero.py -u http://example.com | grep -E "(legacy|deprecated|v1)"
Authentication Points
Seção intitulada “Authentication Points”# Find login/auth paths
python3 parsero.py -u http://example.com | grep -E "(login|auth|signin|register)"
# Identify account management
python3 parsero.py -u http://example.com | grep -E "(profile|account|user)"
# Find admin interfaces
python3 parsero.py -u http://example.com | grep -E "(admin|panel|dashboard)"
Integration with Other Tools
Seção intitulada “Integration with Other Tools”Combine with Directory Bruting
Seção intitulada “Combine with Directory Bruting”# Use Parsero output for targeted bruting
python3 parsero.py -u http://example.com -o discovered.txt
# Verify discoveries with dirbuster
dirbuster -u http://example.com -l discovered.txt -r report.html
Feed into Web Crawling
Seção intitulada “Feed into Web Crawling”# Parse robots.txt and feed to crawler
python3 parsero.py -u http://example.com -o urls.txt
# Crawl discovered URLs
wget -i urls.txt --no-parent
# Or with curl
cat urls.txt | xargs -I {} curl -I {}
Correlate with Scan Results
Seção intitulada “Correlate with Scan Results”# Compare robots.txt with actual structure
python3 parsero.py -u http://example.com -o from_robots.txt
# Use nmap/nikto to verify access
nikto -h http://example.com -o nikto_results.txt
# Cross-reference findings
comm -12 <(sort from_robots.txt) <(sort nikto_results.txt)
Batch Operations
Seção intitulada “Batch Operations”Process Multiple Sites
Seção intitulada “Process Multiple Sites”#!/bin/bash
# Batch processing script
for url in $(cat sites.txt); do
echo "Processing: $url"
python3 parsero.py -u "$url" \
-o "results/$(echo $url | cut -d'/' -f3).txt" \
-v
done
Aggregate Results
Seção intitulada “Aggregate Results”# Combine results from multiple sites
python3 parsero.py -f urls.txt -o combined.txt
# Create summary statistics
echo "Total unique paths found:"
cat combined.txt | sort | uniq | wc -l
# Find most common path patterns
cat combined.txt | grep -oP '^/[^/]+' | sort | uniq -c | sort -rn
Automated Reporting
Seção intitulada “Automated Reporting”# Create comprehensive report
python3 parsero.py -u http://example.com \
-o report.txt \
-f json \
-v
# Generate formatted output
echo "=== robots.txt Analysis ===" > full_report.txt
echo "Target: example.com" >> full_report.txt
echo "Date: $(date)" >> full_report.txt
cat report.txt >> full_report.txt
Evasion and Stealth
Seção intitulada “Evasion and Stealth”Rate Limiting
Seção intitulada “Rate Limiting”# Reduce detection likelihood
python3 parsero.py -u http://example.com --delay 5
# Multiple requests with delays
for url in $(cat urls.txt); do
python3 parsero.py -u "$url" --delay 10
sleep 30
done
User-Agent Rotation
Seção intitulada “User-Agent Rotation”# Use different user-agents
python3 parsero.py -u http://example.com --user-agent "Googlebot/2.1"
python3 parsero.py -u http://example.com --user-agent "Mozilla/5.0"
python3 parsero.py -u http://example.com --user-agent "bingbot/2.0"
Obfuscation
Seção intitulada “Obfuscation”# Use proxy to mask origin
python3 parsero.py -u http://example.com \
--proxy http://proxy.example.com:8080 \
--delay 5 \
--user-agent "Googlebot"
Data Analysis
Seção intitulada “Data Analysis”Path Frequency Analysis
Seção intitulada “Path Frequency Analysis”# Find most restricted paths
python3 parsero.py -f urls.txt -o results.txt
# Analyze frequency
cat results.txt | sort | uniq -c | sort -rn
# Export statistics
python3 << 'EOF'
import re
from collections import Counter
with open('results.txt', 'r') as f:
paths = f.readlines()
# Extract path components
components = []
for path in paths:
parts = path.strip().split('/')
components.extend([p for p in parts if p])
counter = Counter(components)
for comp, count in counter.most_common(20):
print(f"{comp}: {count}")
EOF
Structural Mapping
Seção intitulada “Structural Mapping”# Build hierarchy of paths
python3 parsero.py -u http://example.com -v | \
sort | \
sed 's|/[^/]*$||' | \
sort | uniq -c | sort -rn
Troubleshooting
Seção intitulada “Troubleshooting”Connection Issues
Seção intitulada “Connection Issues”# Test basic connectivity
python3 parsero.py -u http://example.com -t 60
# Verify robots.txt exists
curl -I http://example.com/robots.txt
# Check with specific user-agent
python3 parsero.py -u http://example.com --user-agent "Mozilla/5.0" -v
Empty Results
Seção intitulada “Empty Results”# Verify site has robots.txt
wget -q -O - http://example.com/robots.txt
# Check if site blocks parsing
python3 parsero.py -u http://example.com -vv
# Verify URL format
python3 parsero.py -u http://example.com:80 # Explicit port
Proxy Issues
Seção intitulada “Proxy Issues”# Test proxy connectivity
python3 parsero.py -u http://example.com --proxy http://127.0.0.1:8080 -vv
# Verify credentials
python3 parsero.py -u http://example.com \
--proxy http://user:password@proxy:8080
Best Practices
Seção intitulada “Best Practices”Authorized Testing
Seção intitulada “Authorized Testing”- Obtain written authorization before analysis
- Respect robots.txt directives in production
- Document all discovered information
- Follow responsible disclosure practices
- Maintain ethical standards
Effective Analysis
Seção intitulada “Effective Analysis”# Comprehensive reconnaissance workflow
python3 parsero.py -u http://target.example.com \
-o robots_analysis.txt \
-f json \
-f csv \
-v
# Create summary
echo "=== Robots.txt Analysis Summary ===" > summary.txt
echo "Total entries: $(wc -l < robots_analysis.txt)" >> summary.txt
echo "Unique paths: $(sort robots_analysis.txt | uniq | wc -l)" >> summary.txt
Legal and Ethical Considerations
Seção intitulada “Legal and Ethical Considerations”Parsero should be used:
- For authorized security assessments
- With written permission from site owners
- To improve understanding of web security
- In compliance with applicable laws
- Respecting privacy and confidentiality
Never:
- Access restricted paths without authorization
- Download sensitive files from robots.txt
- Share discovered information publicly
- Attempt to access restricted areas
- Violate applicable laws or regulations
Resources
Seção intitulada “Resources”- GitHub: https://github.com/behindthefirewalls/Parsero
- robots.txt specification: https://www.robotstxt.org/
- Web reconnaissance guides
- OWASP reconnaissance methodology
- Responsible disclosure practices