GoSpider is a fast, lightweight web spider written in Go for crawling websites and extracting links. It performs reconnaissance by discovering endpoints, JavaScript files, and other resources referenced in a web application. GoSpider is useful for OSINT gathering, application mapping, and authorized penetration testing.
# Install Go (if not already installed)
sudo apt-get update
sudo apt-get install golang-go git
# Verify Go installation
go version
# Clone and build
git clone https://github.com/jaeles-project/gospider.git
cd gospider
go build -o gospider main.go
# Or use go install
go install github.com/jaeles-project/gospider@latest
# Verify installation
./gospider --help
# or if using go install
gospider --help
# With custom build flags
go build -ldflags "-s -w" -o gospider main.go
# Check binary
file gospider
./gospider -h
gospider -s <url> [options]
| Option | Description | Example |
|---|
-s, --site | Target URL (required) | -s https://example.com |
-c, --cookie | HTTP cookie | -c "session=abc123" |
-H, --header | Custom HTTP header | -H "Authorization: Bearer token" |
-p, --proxy | HTTP proxy URL | -p http://proxy:8080 |
-o, --output | Output file | -o results.txt |
-d, --depth | Crawl depth | -d 3 |
-t, --timeout | Request timeout (seconds) | -t 10 |
-c, --concurrent | Concurrent requests | -c 50 |
--useragent | Custom User-Agent | --useragent "Mozilla/5.0..." |
-u, --user | HTTP auth username | -u admin |
-pw, --password | HTTP auth password | -pw password123 |
-x, --exclude | Exclude patterns | -x "logout,signout" |
--other-source | Include other sources | --other-source |
-k, --insecure | Skip SSL verification | -k |
# Simple crawl of a website
gospider -s https://example.com
# Crawl and save output
gospider -s https://example.com -o results.txt
# Crawl with specific depth
gospider -s https://example.com -d 2
# Basic authentication
gospider -s https://example.com -u admin -pw password123
# Cookie-based authentication
gospider -s https://example.com -c "PHPSESSID=abc123def456"
# Custom Authorization header
gospider -s https://example.com -H "Authorization: Bearer eyJhbGc..."
# Crawl HTTPS without SSL verification (insecure websites)
gospider -s https://example.com -k
# With custom certificate
gospider -s https://self-signed.example.com -k -c "sessionid=value"
# Increase concurrent requests for faster crawling
gospider -s https://example.com -c 100 -d 3
# Balance between speed and load
gospider -s https://example.com -c 50 -t 10
# Route through HTTP proxy
gospider -s https://example.com -p http://proxy.company.com:8080
# Proxy with authentication
gospider -s https://example.com -p http://user:pass@proxy:8080
# SOCKS5 proxy
# Note: GoSpider primarily supports HTTP proxies
# Multiple custom headers
gospider -s https://example.com \
-H "Authorization: Bearer token123" \
-H "X-Custom-Header: value" \
-c "session=abc123; lang=en"
# User-Agent spoofing
gospider -s https://example.com \
--useragent "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
# Exclude certain paths or patterns
gospider -s https://example.com \
-x "logout,signout,search,download"
# Depth-limited crawl
gospider -s https://example.com -d 1 # Only links from homepage
gospider -s https://example.com -d 2 # One level deep
gospider -s https://example.com -d 5 # Very deep crawl
# Default output to console
gospider -s https://example.com
# Save to file
gospider -s https://example.com > results.txt
# Save with timestamp
gospider -s https://example.com > results_$(date +%Y%m%d_%H%M%S).txt
# Extract unique URLs
gospider -s https://example.com | sort -u > unique_urls.txt
# Count discovered URLs
gospider -s https://example.com | wc -l
# Filter by file type
gospider -s https://example.com | grep -E "\.(js|json|xml|pdf)$"
# Find JavaScript files
gospider -s https://example.com | grep "\.js$" > js_files.txt
# Find API endpoints
gospider -s https://example.com | grep -E "/api/" > api_endpoints.txt
# Extract domain only
gospider -s https://example.com | cut -d'/' -f3 | sort -u
# Extract paths only
gospider -s https://example.com | sed 's|https\?://[^/]*||' | sort -u
# Find parameter patterns
gospider -s https://example.com | grep "?" > parameterized_urls.txt
# Group by file extension
gospider -s https://example.com | \
awk -F'.' '{print $NF}' | sort | uniq -c | sort -rn
#!/usr/bin/env python3
import subprocess
import re
from urllib.parse import urlparse, parse_qs
from collections import defaultdict
def crawl_and_analyze(url):
"""Crawl URL and analyze results"""
cmd = ['gospider', '-s', url]
try:
output = subprocess.check_output(cmd, text=True, stderr=subprocess.DEVNULL)
except subprocess.CalledProcessError as e:
print(f"Error: {e}")
return None
urls = output.strip().split('\n')
analysis = {
'total': len(urls),
'unique': len(set(urls)),
'by_ext': defaultdict(int),
'by_path': defaultdict(int),
'with_params': 0,
'by_domain': defaultdict(int)
}
for url_str in urls:
if not url_str:
continue
# Parse URL
parsed = urlparse(url_str)
# Count by extension
ext = parsed.path.split('.')[-1] if '.' in parsed.path else 'no_ext'
analysis['by_ext'][ext] += 1
# Count by domain
analysis['by_domain'][parsed.netloc] += 1
# Count with parameters
if parsed.query:
analysis['with_params'] += 1
# Path analysis
path = parsed.path[:50] # First 50 chars
analysis['by_path'][path] += 1
return analysis, urls
def print_report(analysis, urls):
"""Print detailed analysis report"""
print(f"\n[+] Crawl Results Summary")
print(f" Total URLs: {analysis['total']}")
print(f" Unique URLs: {analysis['unique']}")
print(f" URLs with parameters: {analysis['with_params']}")
print(f"\n[+] Top file types:")
for ext, count in sorted(analysis['by_ext'].items(),
key=lambda x: x[1], reverse=True)[:10]:
print(f" .{ext}: {count}")
print(f"\n[+] Domains found:")
for domain, count in sorted(analysis['by_domain'].items(),
key=lambda x: x[1], reverse=True):
print(f" {domain}: {count} URLs")
# Example usage
if __name__ == '__main__':
target = "https://example.com"
analysis, urls = crawl_and_analyze(target)
if analysis:
print_report(analysis, urls)
# Save results
with open('crawl_results.txt', 'w') as f:
f.write('\n'.join(urls))
print(f"\n[+] Results saved to crawl_results.txt")
# Basic crawl of target
gospider -s https://example.com -d 1 -o step1_homepage.txt
# Check results
head step1_homepage.txt
# Deeper crawl with more concurrency
gospider -s https://example.com -d 3 -c 50 -o step2_deeper.txt
# Find JavaScript files for analysis
grep "\.js$" step2_deeper.txt > javascript_files.txt
# Look for subdomains in crawled URLs
grep -oP '(?:https?://)?(?:\w+\.)*\K[\w-]+(?=\.example\.com)' \
step2_deeper.txt | sort -u > subdomains.txt
# Crawl each subdomain
for subdomain in $(cat subdomains.txt); do
echo "[*] Crawling $subdomain.example.com"
gospider -s https://$subdomain.example.com -d 2 >> all_subdomains.txt
done
# Find API endpoints
grep -E "/api/|/rest/|/v[0-9]/" step2_deeper.txt > api_endpoints.txt
# Find endpoints with parameters
grep "?" step2_deeper.txt > parameterized_endpoints.txt
# Analyze parameter names
grep -oP '\?[^&]*' step2_deeper.txt | sort -u
#!/bin/bash
# Crawl multiple domains
targets=(
"https://example1.com"
"https://example2.com"
"https://example3.com"
)
output_dir="crawl_results"
mkdir -p $output_dir
for target in "${targets[@]}"; do
echo "[*] Crawling $target"
domain=$(echo $target | cut -d'/' -f3)
gospider -s $target -d 2 -c 50 -o $output_dir/$domain.txt
done
echo "[+] All crawls complete"
echo "Results in $output_dir/"
# Crawl and discover endpoints, then test with Nuclei
gospider -s https://example.com -d 2 > endpoints.txt
# Use endpoints with Nuclei
cat endpoints.txt | nuclei -l - -t cves/
#!/bin/bash
# Comprehensive reconnaissance
target=$1
echo "[*] Starting reconnaissance on $target"
# Crawl
gospider -s "$target" -d 3 -c 100 > crawl_results.txt
# Extract JavaScript files
grep "\.js$" crawl_results.txt > js_files.txt
echo "[+] Found $(wc -l < js_files.txt) JavaScript files"
# Extract API endpoints
grep "/api/" crawl_results.txt > api_endpoints.txt
echo "[+] Found $(wc -l < api_endpoints.txt) API endpoints"
# Extract unique domains
grep -oP '(?:https?://)?(?P<domain>[^/]+)' crawl_results.txt | \
sort -u > discovered_domains.txt
echo "[+] Found $(wc -l < discovered_domains.txt) unique domains"
# Extract URLs with parameters
grep "?" crawl_results.txt > parameterized_urls.txt
echo "[+] Found $(wc -l < parameterized_urls.txt) parameterized URLs"
# Maximum speed (aggressive)
gospider -s https://example.com \
-d 3 \
-c 200 \
-t 5
# Balanced approach
gospider -s https://example.com \
-d 2 \
-c 50 \
-t 10
# Stealth approach (slow, less detection)
gospider -s https://example.com \
-d 1 \
-c 10 \
-t 30
| Depth | Typical URLs | Speed | Memory |
|---|
| 1 | 50-200 | Very Fast | Low |
| 2 | 200-1000 | Fast | Low-Med |
| 3 | 500-3000 | Medium | Medium |
| 4+ | 2000+ | Slow | High |
| Problem | Solution |
|---|
| Connection refused | Target may be blocking; try reducing concurrency |
| Timeout errors | Increase timeout: -t 20 or -t 30 |
| Too few results | Increase depth: -d 3, check robots.txt |
| Memory issues | Reduce concurrent requests: -c 20 |
| SSL errors | Use -k flag for insecure connections |
# Verbose output (if available)
gospider -s https://example.com -v
# Test connectivity
curl -I https://example.com
# Check for redirects
curl -L -I https://example.com
# Monitor resource usage
time gospider -s https://example.com -d 2
- Only crawl websites you own or have explicit authorization
- Respect robots.txt and terms of service
- Use appropriate delays and concurrency levels
- Identify yourself with appropriate user-agent if needed
- Document all scanning activities
# Respectful crawling approach
gospider -s https://example.com \
-d 2 \ # Limited depth
-c 10 \ # Low concurrency
-t 20 \ # Reasonable timeout
--useragent "Reconnaissance Bot (contact: admin@example.com)"
| Tool | Advantage |
|---|
| Burp Suite Spider | Commercial, full-featured GUI |
| Scrapy | Python-based, more customizable |
| OWASP ZAP Spider | Free, integrated security testing |
| Wget | Simple, recursive download |
| Curl | Lightweight, script-friendly |