Zum Inhalt springen

GoSpider

GoSpider is a fast, lightweight web spider written in Go for crawling websites and extracting links. It performs reconnaissance by discovering endpoints, JavaScript files, and other resources referenced in a web application. GoSpider is useful for OSINT gathering, application mapping, and authorized penetration testing.

# Install Go (if not already installed)
sudo apt-get update
sudo apt-get install golang-go git

# Verify Go installation
go version
# Clone and build
git clone https://github.com/jaeles-project/gospider.git
cd gospider
go build -o gospider main.go

# Or use go install
go install github.com/jaeles-project/gospider@latest

# Verify installation
./gospider --help
# or if using go install
gospider --help
# With custom build flags
go build -ldflags "-s -w" -o gospider main.go

# Check binary
file gospider
./gospider -h
gospider -s <url> [options]
OptionDescriptionExample
-s, --siteTarget URL (required)-s https://example.com
-c, --cookieHTTP cookie-c "session=abc123"
-H, --headerCustom HTTP header-H "Authorization: Bearer token"
-p, --proxyHTTP proxy URL-p http://proxy:8080
-o, --outputOutput file-o results.txt
-d, --depthCrawl depth-d 3
-t, --timeoutRequest timeout (seconds)-t 10
-c, --concurrentConcurrent requests-c 50
--useragentCustom User-Agent--useragent "Mozilla/5.0..."
-u, --userHTTP auth username-u admin
-pw, --passwordHTTP auth password-pw password123
-x, --excludeExclude patterns-x "logout,signout"
--other-sourceInclude other sources--other-source
-k, --insecureSkip SSL verification-k
# Simple crawl of a website
gospider -s https://example.com

# Crawl and save output
gospider -s https://example.com -o results.txt

# Crawl with specific depth
gospider -s https://example.com -d 2
# Basic authentication
gospider -s https://example.com -u admin -pw password123

# Cookie-based authentication
gospider -s https://example.com -c "PHPSESSID=abc123def456"

# Custom Authorization header
gospider -s https://example.com -H "Authorization: Bearer eyJhbGc..."
# Crawl HTTPS without SSL verification (insecure websites)
gospider -s https://example.com -k

# With custom certificate
gospider -s https://self-signed.example.com -k -c "sessionid=value"
# Increase concurrent requests for faster crawling
gospider -s https://example.com -c 100 -d 3

# Balance between speed and load
gospider -s https://example.com -c 50 -t 10
# Route through HTTP proxy
gospider -s https://example.com -p http://proxy.company.com:8080

# Proxy with authentication
gospider -s https://example.com -p http://user:pass@proxy:8080

# SOCKS5 proxy
# Note: GoSpider primarily supports HTTP proxies
# Multiple custom headers
gospider -s https://example.com \
  -H "Authorization: Bearer token123" \
  -H "X-Custom-Header: value" \
  -c "session=abc123; lang=en"

# User-Agent spoofing
gospider -s https://example.com \
  --useragent "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
# Exclude certain paths or patterns
gospider -s https://example.com \
  -x "logout,signout,search,download"

# Depth-limited crawl
gospider -s https://example.com -d 1  # Only links from homepage
gospider -s https://example.com -d 2  # One level deep
gospider -s https://example.com -d 5  # Very deep crawl
# Default output to console
gospider -s https://example.com

# Save to file
gospider -s https://example.com > results.txt

# Save with timestamp
gospider -s https://example.com > results_$(date +%Y%m%d_%H%M%S).txt
# Extract unique URLs
gospider -s https://example.com | sort -u > unique_urls.txt

# Count discovered URLs
gospider -s https://example.com | wc -l

# Filter by file type
gospider -s https://example.com | grep -E "\.(js|json|xml|pdf)$"

# Find JavaScript files
gospider -s https://example.com | grep "\.js$" > js_files.txt

# Find API endpoints
gospider -s https://example.com | grep -E "/api/" > api_endpoints.txt
# Extract domain only
gospider -s https://example.com | cut -d'/' -f3 | sort -u

# Extract paths only
gospider -s https://example.com | sed 's|https\?://[^/]*||' | sort -u

# Find parameter patterns
gospider -s https://example.com | grep "?" > parameterized_urls.txt

# Group by file extension
gospider -s https://example.com | \
  awk -F'.' '{print $NF}' | sort | uniq -c | sort -rn
#!/usr/bin/env python3
import subprocess
import re
from urllib.parse import urlparse, parse_qs
from collections import defaultdict

def crawl_and_analyze(url):
    """Crawl URL and analyze results"""
    cmd = ['gospider', '-s', url]
    
    try:
        output = subprocess.check_output(cmd, text=True, stderr=subprocess.DEVNULL)
    except subprocess.CalledProcessError as e:
        print(f"Error: {e}")
        return None
    
    urls = output.strip().split('\n')
    analysis = {
        'total': len(urls),
        'unique': len(set(urls)),
        'by_ext': defaultdict(int),
        'by_path': defaultdict(int),
        'with_params': 0,
        'by_domain': defaultdict(int)
    }
    
    for url_str in urls:
        if not url_str:
            continue
        
        # Parse URL
        parsed = urlparse(url_str)
        
        # Count by extension
        ext = parsed.path.split('.')[-1] if '.' in parsed.path else 'no_ext'
        analysis['by_ext'][ext] += 1
        
        # Count by domain
        analysis['by_domain'][parsed.netloc] += 1
        
        # Count with parameters
        if parsed.query:
            analysis['with_params'] += 1
        
        # Path analysis
        path = parsed.path[:50]  # First 50 chars
        analysis['by_path'][path] += 1
    
    return analysis, urls

def print_report(analysis, urls):
    """Print detailed analysis report"""
    print(f"\n[+] Crawl Results Summary")
    print(f"    Total URLs: {analysis['total']}")
    print(f"    Unique URLs: {analysis['unique']}")
    print(f"    URLs with parameters: {analysis['with_params']}")
    
    print(f"\n[+] Top file types:")
    for ext, count in sorted(analysis['by_ext'].items(), 
                            key=lambda x: x[1], reverse=True)[:10]:
        print(f"    .{ext}: {count}")
    
    print(f"\n[+] Domains found:")
    for domain, count in sorted(analysis['by_domain'].items(),
                               key=lambda x: x[1], reverse=True):
        print(f"    {domain}: {count} URLs")

# Example usage
if __name__ == '__main__':
    target = "https://example.com"
    analysis, urls = crawl_and_analyze(target)
    
    if analysis:
        print_report(analysis, urls)
        
        # Save results
        with open('crawl_results.txt', 'w') as f:
            f.write('\n'.join(urls))
        print(f"\n[+] Results saved to crawl_results.txt")
# Basic crawl of target
gospider -s https://example.com -d 1 -o step1_homepage.txt

# Check results
head step1_homepage.txt
# Deeper crawl with more concurrency
gospider -s https://example.com -d 3 -c 50 -o step2_deeper.txt

# Find JavaScript files for analysis
grep "\.js$" step2_deeper.txt > javascript_files.txt
# Look for subdomains in crawled URLs
grep -oP '(?:https?://)?(?:\w+\.)*\K[\w-]+(?=\.example\.com)' \
  step2_deeper.txt | sort -u > subdomains.txt

# Crawl each subdomain
for subdomain in $(cat subdomains.txt); do
  echo "[*] Crawling $subdomain.example.com"
  gospider -s https://$subdomain.example.com -d 2 >> all_subdomains.txt
done
# Find API endpoints
grep -E "/api/|/rest/|/v[0-9]/" step2_deeper.txt > api_endpoints.txt

# Find endpoints with parameters
grep "?" step2_deeper.txt > parameterized_endpoints.txt

# Analyze parameter names
grep -oP '\?[^&]*' step2_deeper.txt | sort -u
#!/bin/bash
# Crawl multiple domains

targets=(
  "https://example1.com"
  "https://example2.com"
  "https://example3.com"
)

output_dir="crawl_results"
mkdir -p $output_dir

for target in "${targets[@]}"; do
  echo "[*] Crawling $target"
  domain=$(echo $target | cut -d'/' -f3)
  gospider -s $target -d 2 -c 50 -o $output_dir/$domain.txt
done

echo "[+] All crawls complete"
echo "Results in $output_dir/"
# Crawl and discover endpoints, then test with Nuclei
gospider -s https://example.com -d 2 > endpoints.txt

# Use endpoints with Nuclei
cat endpoints.txt | nuclei -l - -t cves/
#!/bin/bash
# Comprehensive reconnaissance

target=$1
echo "[*] Starting reconnaissance on $target"

# Crawl
gospider -s "$target" -d 3 -c 100 > crawl_results.txt

# Extract JavaScript files
grep "\.js$" crawl_results.txt > js_files.txt
echo "[+] Found $(wc -l < js_files.txt) JavaScript files"

# Extract API endpoints
grep "/api/" crawl_results.txt > api_endpoints.txt
echo "[+] Found $(wc -l < api_endpoints.txt) API endpoints"

# Extract unique domains
grep -oP '(?:https?://)?(?P<domain>[^/]+)' crawl_results.txt | \
  sort -u > discovered_domains.txt
echo "[+] Found $(wc -l < discovered_domains.txt) unique domains"

# Extract URLs with parameters
grep "?" crawl_results.txt > parameterized_urls.txt
echo "[+] Found $(wc -l < parameterized_urls.txt) parameterized URLs"
# Maximum speed (aggressive)
gospider -s https://example.com \
  -d 3 \
  -c 200 \
  -t 5

# Balanced approach
gospider -s https://example.com \
  -d 2 \
  -c 50 \
  -t 10

# Stealth approach (slow, less detection)
gospider -s https://example.com \
  -d 1 \
  -c 10 \
  -t 30
DepthTypical URLsSpeedMemory
150-200Very FastLow
2200-1000FastLow-Med
3500-3000MediumMedium
4+2000+SlowHigh
ProblemSolution
Connection refusedTarget may be blocking; try reducing concurrency
Timeout errorsIncrease timeout: -t 20 or -t 30
Too few resultsIncrease depth: -d 3, check robots.txt
Memory issuesReduce concurrent requests: -c 20
SSL errorsUse -k flag for insecure connections
# Verbose output (if available)
gospider -s https://example.com -v

# Test connectivity
curl -I https://example.com

# Check for redirects
curl -L -I https://example.com

# Monitor resource usage
time gospider -s https://example.com -d 2
  • Only crawl websites you own or have explicit authorization
  • Respect robots.txt and terms of service
  • Use appropriate delays and concurrency levels
  • Identify yourself with appropriate user-agent if needed
  • Document all scanning activities
# Respectful crawling approach
gospider -s https://example.com \
  -d 2 \                    # Limited depth
  -c 10 \                   # Low concurrency
  -t 20 \                   # Reasonable timeout
  --useragent "Reconnaissance Bot (contact: admin@example.com)"
ToolAdvantage
Burp Suite SpiderCommercial, full-featured GUI
ScrapyPython-based, more customizable
OWASP ZAP SpiderFree, integrated security testing
WgetSimple, recursive download
CurlLightweight, script-friendly