GoSpider

Overview

GoSpider is a fast, lightweight web spider written in Go for crawling websites and extracting links. It performs reconnaissance by discovering endpoints, JavaScript files, and other resources referenced in a web application. GoSpider is useful for OSINT gathering, application mapping, and authorized penetration testing.

Installation

Prerequisites

# Install Go (if not already installed)
sudo apt-get update
sudo apt-get install golang-go git

# Verify Go installation
go version

Install GoSpider

# Clone and build
git clone https://github.com/jaeles-project/gospider.git
cd gospider
go build -o gospider main.go

# Or use go install
go install github.com/jaeles-project/gospider@latest

# Verify installation
./gospider --help
# or if using go install
gospider --help

Build from Source with Custom Options

# With custom build flags
go build -ldflags "-s -w" -o gospider main.go

# Check binary
file gospider
./gospider -h

Basic Syntax

gospider -s <url> [options]

Command Line Options

Option	Description	Example
`-s, --site`	Target URL (required)	`-s https://example.com`
`-c, --cookie`	HTTP cookie	`-c "session=abc123"`
`-H, --header`	Custom HTTP header	`-H "Authorization: Bearer token"`
`-p, --proxy`	HTTP proxy URL	`-p http://proxy:8080`
`-o, --output`	Output file	`-o results.txt`
`-d, --depth`	Crawl depth	`-d 3`
`-t, --timeout`	Request timeout (seconds)	`-t 10`
`-c, --concurrent`	Concurrent requests	`-c 50`
`--useragent`	Custom User-Agent	`--useragent "Mozilla/5.0..."`
`-u, --user`	HTTP auth username	`-u admin`
`-pw, --password`	HTTP auth password	`-pw password123`
`-x, --exclude`	Exclude patterns	`-x "logout,signout"`
`--other-source`	Include other sources	`--other-source`
`-k, --insecure`	Skip SSL verification	`-k`

Quick Start Examples

Basic Web Crawling

# Simple crawl of a website
gospider -s https://example.com

# Crawl and save output
gospider -s https://example.com -o results.txt

# Crawl with specific depth
gospider -s https://example.com -d 2

Crawl with Authentication

# Basic authentication
gospider -s https://example.com -u admin -pw password123

# Cookie-based authentication
gospider -s https://example.com -c "PHPSESSID=abc123def456"

# Custom Authorization header
gospider -s https://example.com -H "Authorization: Bearer eyJhbGc..."

HTTPS and SSL

# Crawl HTTPS without SSL verification (insecure websites)
gospider -s https://example.com -k

# With custom certificate
gospider -s https://self-signed.example.com -k -c "sessionid=value"

Advanced Usage

Concurrent Crawling

# Increase concurrent requests for faster crawling
gospider -s https://example.com -c 100 -d 3

# Balance between speed and load
gospider -s https://example.com -c 50 -t 10

Proxy Configuration

# Route through HTTP proxy
gospider -s https://example.com -p http://proxy.company.com:8080

# Proxy with authentication
gospider -s https://example.com -p http://user:pass@proxy:8080

# SOCKS5 proxy
# Note: GoSpider primarily supports HTTP proxies

Custom Headers and Cookies

# Multiple custom headers
gospider -s https://example.com \
  -H "Authorization: Bearer token123" \
  -H "X-Custom-Header: value" \
  -c "session=abc123; lang=en"

# User-Agent spoofing
gospider -s https://example.com \
  --useragent "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

Filtering and Exclusion

# Exclude certain paths or patterns
gospider -s https://example.com \
  -x "logout,signout,search,download"

# Depth-limited crawl
gospider -s https://example.com -d 1  # Only links from homepage
gospider -s https://example.com -d 2  # One level deep
gospider -s https://example.com -d 5  # Very deep crawl

Output Formats and Analysis

Basic Output

# Default output to console
gospider -s https://example.com

# Save to file
gospider -s https://example.com > results.txt

# Save with timestamp
gospider -s https://example.com > results_$(date +%Y%m%d_%H%M%S).txt

Process Results

# Extract unique URLs
gospider -s https://example.com | sort -u > unique_urls.txt

# Count discovered URLs
gospider -s https://example.com | wc -l

# Filter by file type
gospider -s https://example.com | grep -E "\.(js|json|xml|pdf)$"

# Find JavaScript files
gospider -s https://example.com | grep "\.js$" > js_files.txt

# Find API endpoints
gospider -s https://example.com | grep -E "/api/" > api_endpoints.txt

Advanced Result Processing

# Extract domain only
gospider -s https://example.com | cut -d'/' -f3 | sort -u

# Extract paths only
gospider -s https://example.com | sed 's|https\?://[^/]*||' | sort -u

# Find parameter patterns
gospider -s https://example.com | grep "?" > parameterized_urls.txt

# Group by file extension
gospider -s https://example.com | \
  awk -F'.' '{print $NF}' | sort | uniq -c | sort -rn

Python Script for Advanced Processing

#!/usr/bin/env python3
import subprocess
import re
from urllib.parse import urlparse, parse_qs
from collections import defaultdict

def crawl_and_analyze(url):
    """Crawl URL and analyze results"""
    cmd = ['gospider', '-s', url]
    
    try:
        output = subprocess.check_output(cmd, text=True, stderr=subprocess.DEVNULL)
    except subprocess.CalledProcessError as e:
        print(f"Error: {e}")
        return None
    
    urls = output.strip().split('\n')
    analysis = {
        'total': len(urls),
        'unique': len(set(urls)),
        'by_ext': defaultdict(int),
        'by_path': defaultdict(int),
        'with_params': 0,
        'by_domain': defaultdict(int)
    }
    
    for url_str in urls:
        if not url_str:
            continue
        
        # Parse URL
        parsed = urlparse(url_str)
        
        # Count by extension
        ext = parsed.path.split('.')[-1] if '.' in parsed.path else 'no_ext'
        analysis['by_ext'][ext] += 1
        
        # Count by domain
        analysis['by_domain'][parsed.netloc] += 1
        
        # Count with parameters
        if parsed.query:
            analysis['with_params'] += 1
        
        # Path analysis
        path = parsed.path[:50]  # First 50 chars
        analysis['by_path'][path] += 1
    
    return analysis, urls

def print_report(analysis, urls):
    """Print detailed analysis report"""
    print(f"\n[+] Crawl Results Summary")
    print(f"    Total URLs: {analysis['total']}")
    print(f"    Unique URLs: {analysis['unique']}")
    print(f"    URLs with parameters: {analysis['with_params']}")
    
    print(f"\n[+] Top file types:")
    for ext, count in sorted(analysis['by_ext'].items(), 
                            key=lambda x: x[1], reverse=True)[:10]:
        print(f"    .{ext}: {count}")
    
    print(f"\n[+] Domains found:")
    for domain, count in sorted(analysis['by_domain'].items(),
                               key=lambda x: x[1], reverse=True):
        print(f"    {domain}: {count} URLs")

# Example usage
if __name__ == '__main__':
    target = "https://example.com"
    analysis, urls = crawl_and_analyze(target)
    
    if analysis:
        print_report(analysis, urls)
        
        # Save results
        with open('crawl_results.txt', 'w') as f:
            f.write('\n'.join(urls))
        print(f"\n[+] Results saved to crawl_results.txt")

Reconnaissance Workflow

Step 1: Initial Crawl

# Basic crawl of target
gospider -s https://example.com -d 1 -o step1_homepage.txt

# Check results
head step1_homepage.txt

Step 2: Deeper Crawl

# Deeper crawl with more concurrency
gospider -s https://example.com -d 3 -c 50 -o step2_deeper.txt

# Find JavaScript files for analysis
grep "\.js$" step2_deeper.txt > javascript_files.txt

Step 3: Subdomain Discovery

# Look for subdomains in crawled URLs
grep -oP '(?:https?://)?(?:\w+\.)*\K[\w-]+(?=\.example\.com)' \
  step2_deeper.txt | sort -u > subdomains.txt

# Crawl each subdomain
for subdomain in $(cat subdomains.txt); do
  echo "[*] Crawling $subdomain.example.com"
  gospider -s https://$subdomain.example.com -d 2 >> all_subdomains.txt
done

Step 4: API Discovery

# Find API endpoints
grep -E "/api/|/rest/|/v[0-9]/" step2_deeper.txt > api_endpoints.txt

# Find endpoints with parameters
grep "?" step2_deeper.txt > parameterized_endpoints.txt

# Analyze parameter names
grep -oP '\?[^&]*' step2_deeper.txt | sort -u

Batch Crawling Multiple Sites

#!/bin/bash
# Crawl multiple domains

targets=(
  "https://example1.com"
  "https://example2.com"
  "https://example3.com"
)

output_dir="crawl_results"
mkdir -p $output_dir

for target in "${targets[@]}"; do
  echo "[*] Crawling $target"
  domain=$(echo $target | cut -d'/' -f3)
  gospider -s $target -d 2 -c 50 -o $output_dir/$domain.txt
done

echo "[+] All crawls complete"
echo "Results in $output_dir/"

Integration with Other Tools

With Nuclei for Vulnerability Scanning

# Crawl and discover endpoints, then test with Nuclei
gospider -s https://example.com -d 2 > endpoints.txt

# Use endpoints with Nuclei
cat endpoints.txt | nuclei -l - -t cves/

With Custom Analysis

#!/bin/bash
# Comprehensive reconnaissance

target=$1
echo "[*] Starting reconnaissance on $target"

# Crawl
gospider -s "$target" -d 3 -c 100 > crawl_results.txt

# Extract JavaScript files
grep "\.js$" crawl_results.txt > js_files.txt
echo "[+] Found $(wc -l < js_files.txt) JavaScript files"

# Extract API endpoints
grep "/api/" crawl_results.txt > api_endpoints.txt
echo "[+] Found $(wc -l < api_endpoints.txt) API endpoints"

# Extract unique domains
grep -oP '(?:https?://)?(?P<domain>[^/]+)' crawl_results.txt | \
  sort -u > discovered_domains.txt
echo "[+] Found $(wc -l < discovered_domains.txt) unique domains"

# Extract URLs with parameters
grep "?" crawl_results.txt > parameterized_urls.txt
echo "[+] Found $(wc -l < parameterized_urls.txt) parameterized URLs"

Performance Optimization

Tuning for Speed

# Maximum speed (aggressive)
gospider -s https://example.com \
  -d 3 \
  -c 200 \
  -t 5

# Balanced approach
gospider -s https://example.com \
  -d 2 \
  -c 50 \
  -t 10

# Stealth approach (slow, less detection)
gospider -s https://example.com \
  -d 1 \
  -c 10 \
  -t 30

Depth vs Speed Tradeoff

Depth	Typical URLs	Speed	Memory
1	50-200	Very Fast	Low
2	200-1000	Fast	Low-Med
3	500-3000	Medium	Medium
4+	2000+	Slow	High

Troubleshooting

Common Issues

Problem	Solution
Connection refused	Target may be blocking; try reducing concurrency
Timeout errors	Increase timeout: `-t 20` or `-t 30`
Too few results	Increase depth: `-d 3`, check robots.txt
Memory issues	Reduce concurrent requests: `-c 20`
SSL errors	Use `-k` flag for insecure connections

Debugging

# Verbose output (if available)
gospider -s https://example.com -v

# Test connectivity
curl -I https://example.com

# Check for redirects
curl -L -I https://example.com

# Monitor resource usage
time gospider -s https://example.com -d 2

Ethical Considerations

Authorized Use

Only crawl websites you own or have explicit authorization
Respect robots.txt and terms of service
Use appropriate delays and concurrency levels
Identify yourself with appropriate user-agent if needed
Document all scanning activities

Responsible Scanning

# Respectful crawling approach
gospider -s https://example.com \
  -d 2 \                    # Limited depth
  -c 10 \                   # Low concurrency
  -t 20 \                   # Reasonable timeout
  --useragent "Reconnaissance Bot (contact: admin@example.com)"

Alternative Tools

Tool	Advantage
Burp Suite Spider	Commercial, full-featured GUI
Scrapy	Python-based, more customizable
OWASP ZAP Spider	Free, integrated security testing
Wget	Simple, recursive download
Curl	Lightweight, script-friendly

Resources

GitHub Repository: https://github.com/jaeles-project/gospider
Go Language: https://golang.org/
Web Crawling Best Practices: https://owasp.org/www-community/attacks/index
Web Spider Ethics: https://www.robotstxt.org/