HTTrack

Overview

HTTrack Website Copier is a free, portable utility that downloads entire websites to your computer, creating a complete offline mirror. It’s invaluable during security assessments for analyzing web applications, discovering hidden directories and files, identifying server configurations, and understanding application architecture. HTTrack is available in Kali Linux and supports multiple platforms.

The tool handles cookies, authentication, JavaScript execution, and can filter content by MIME type, URL patterns, and file extensions. It’s particularly useful for analyzing large web applications and discovering security misconfigurations.

Installation

# Kali Linux (pre-installed)
httrack --version

# Debian/Ubuntu
sudo apt-get install httrack webhttrack

# macOS
brew install httrack

# From source
git clone https://github.com/xroche/httrack
cd httrack
./configure && make
sudo make install

Basic Usage

Command Syntax

httrack <options> <url> [<url2> ...] [-O <folder>]

Simple Website Mirroring

Command	Description
`httrack http://example.com`	Mirror entire website locally
`httrack https://example.com -O ./mirror`	Save to custom directory
`httrack http://example.com/path`	Mirror specific path only
`httrack --help`	Display help information
`httrack --version`	Show version information

Basic Examples

# Mirror a simple website
httrack http://example.com

# Mirror with custom output directory
httrack http://example.com -O ./website_mirror

# Mirror HTTPS website
httrack https://example.com -O ./secure_mirror

# Mirror multiple URLs
httrack http://example.com http://subdomain.example.com -O ./multi_mirror

Common Options

Mirror Scope and Depth

Option	Description	Example
`-r`	Set recursion depth	`httrack -r 5 http://example.com`
`-m`	Maximum file size (KB)	`httrack -m 50000 http://example.com`
`-c`	Number of simultaneous connections	`httrack -c 8 http://example.com`
`-e`	Execution level (0=none, 1=JS, etc.)	`httrack -e 1 http://example.com`

File and Content Filtering

Option	Description	Example
`-A`	Accept MIME types	`httrack -A text/html,text/css`
`-R`	Reject MIME types	`httrack -R .exe,.zip,.iso`
`-%F`	Follow FTP links	`httrack -%F http://example.com`
`--spider`	Spider mode (no download)	`httrack --spider http://example.com`

Behavior and Performance

Option	Description	Example
`-N`	Never overwrite existing files	`httrack -N http://example.com`
`-n`	Maximum files to download	`httrack -n 10000 http://example.com`
`-T`	Connection timeout (seconds)	`httrack -T 60 http://example.com`
`-I`	Identify as browser/bot	`httrack -I http://example.com`

Advanced Options

Authentication and Cookies

# Provide authentication
httrack http://user:password@example.com -O ./authenticated

# Cookie handling
httrack http://example.com -%c -O ./with_cookies

# Custom User-Agent
httrack http://example.com -u "Mozilla/5.0" -O ./custom_ua

URL Filtering

# Include only specific paths
httrack http://example.com/app/* http://example.com/api/* -O ./filtered

# Exclude specific paths
httrack http://example.com -* +*.jpg +*.png -O ./images_only

# Mirror external links (within domain)
httrack http://example.com -%e -O ./external

# Mirror subdomains
httrack http://example.com http://*.example.com -O ./subdomains

Advanced Mirroring Options

# Deep recursive mirror (level 10)
httrack -r 10 http://example.com -O ./deep_mirror

# Large file limits
httrack -m 500000 http://example.com -O ./large_files

# Multiple connections (faster)
httrack -c 16 http://example.com -O ./fast

# Disable Java, Flash, etc.
httrack -* +*.html +*.htm +*.css +*.js +*.jpg +*.gif +*.png http://example.com

Reconnaissance Workflows

Web Application Architecture Discovery

# Mirror target application
httrack https://target-app.com -r 8 -O ./target_mirror

# Analyze directory structure
find ./target_mirror -type d | head -20

# Identify file types
find ./target_mirror -type f | sed 's/.*\.//' | sort | uniq -c

# Extract all URLs
grep -roP 'href="[^"]*"' ./target_mirror/html | cut -d'"' -f2 | sort -u

API Endpoint Discovery

# Mirror API documentation
httrack https://api.example.com -r 6 -O ./api_mirror

# Extract API endpoints
grep -roP '/(api|v[0-9]+)/?[a-zA-Z0-9/_-]*' ./api_mirror/html | sort -u

# Find parameter patterns
grep -roP '\?[a-zA-Z0-9_&=]*' ./api_mirror/html | sort -u

Configuration and Secrets Discovery

# Mirror entire site
httrack http://example.com -r 10 -O ./full_mirror

# Search for config files
find ./full_mirror -name "*.conf" -o -name "*.config" -o -name "*.json" -o -name "*.xml"

# Look for hardcoded credentials
grep -r "password\|apikey\|token\|secret" ./full_mirror/html

# Extract JavaScript for analysis
find ./full_mirror -name "*.js" -type f | head -20

Practical Examples

Example 1: Basic Website Mirror

# Mirror simple website with default settings
httrack http://example.com -O ./mirror_$(date +%Y%m%d)

# Navigate to results
cd mirror_example.com/
ls -la

# Open in browser
firefox index.html

Example 2: Deep Application Analysis

# Mirror with aggressive settings for app discovery
httrack \
  -r 10 \
  -m 100000 \
  -c 16 \
  -T 60 \
  http://target.local:8080 \
  -O ./deep_analysis

# Search for interesting files
find ./deep_analysis -type f \( \
  -name "*.js" -o \
  -name "*.json" -o \
  -name "*.xml" -o \
  -name "*.config" \
\) | wc -l

Example 3: API-Focused Mirror

# Mirror API with specific patterns
httrack \
  -r 6 \
  "https://api.example.com/*" \
  "https://api.example.com/v1/*" \
  "https://api.example.com/v2/*" \
  -O ./api_analysis

# Extract endpoints
grep -roP '/(api|v[0-9]+)/[a-zA-Z0-9/_-]*' ./api_analysis/html | sort -u > endpoints.txt

# Count discovered endpoints
wc -l endpoints.txt

Example 4: Selective Content Mirror

# Mirror only JavaScript and HTML files
httrack \
  -r 8 \
  -* \
  +*.html \
  +*.htm \
  +*.js \
  http://example.com \
  -O ./js_analysis

# Analyze JavaScript sizes
du -h ./js_analysis/html -s
find ./js_analysis -name "*.js" | wc -l

Output and Analysis

Directory Structure

mirror_example.com/
├── index.html                 # Site homepage
├── hts-cache/                # HTTrack cache files
│   ├── new.txt              # Newly discovered URLs
│   ├── seen.txt             # Already processed URLs
│   └── cache.txt            # Cache information
├── backblue.gif
├── cookies.txt              # Saved cookies
└── html/
    ├── example.com/
    │   ├── index.html
    │   ├── about/
    │   ├── contact/
    │   └── ...

Useful Analysis Commands

# Count total files downloaded
find ./mirror -type f | wc -l

# Find all JavaScript files
find ./mirror -name "*.js" | wc -l

# List largest files
du -h ./mirror -S | sort -rh | head -20

# Extract all URLs from HTML
grep -roP '(href|src|action)="[^"]*"' ./mirror -h | cut -d'"' -f2 | sort -u

# Find commented-out code
grep -r "<!--" ./mirror/html | head -20

# Search for API endpoints in JS
grep -r "fetch\|XMLHttpRequest\|axios\|jQuery.ajax" ./mirror -h | head -20

HTTrack GUI (WebHTTrack)

Graphical Interface

# Launch graphical interface
webhttrack

# Or via command line
webhttrack http://example.com

GUI Features

Point-and-click URL configuration
Visual progress monitoring
Pause/resume capability
Browser-based interface
Project management and history

GUI Usage

# Start web interface (port 8080)
webhttrack

# Access at http://localhost:8080
# Configure URLs, options, and monitor progress

Performance Optimization

Multi-Connection Mirroring

# Faster download with more connections
httrack -c 16 http://example.com -O ./fast_mirror

# For very large sites with files
httrack -c 32 -r 10 -m 200000 http://large-site.com -O ./large_mirror

Bandwidth Control

# Limit bandwidth (1MB/s)
httrack --max-rate 1000 http://example.com

# Smaller timeout for unresponsive servers
httrack -T 30 http://slow-server.com

Troubleshooting

Issue	Solution
Connection refused	Check URL, firewall, or proxy settings
Incomplete mirror	Increase recursion depth with `-r`
Large downloads	Set size limit with `-m` or use file filters
Authentication failed	Provide credentials in URL: `http://user:pass@host`
JavaScript not executed	Enable with `-e 1` flag
Timeout errors	Increase timeout: `-T 120`

Advanced Reconnaissance

Full Application Security Testing

# Comprehensive mirror for security analysis
httrack \
  -r 10 \
  -m 100000 \
  -c 16 \
  -T 60 \
  -u "Mozilla/5.0 (Windows)" \
  https://target.com \
  -O ./security_assessment

# Archive the mirror
tar -czf target_mirror_$(date +%Y%m%d_%H%M%S).tar.gz ./security_assessment

# Create inventory
find ./security_assessment -type f > inventory.txt
wc -l inventory.txt

Comparing Two Website Versions

# Mirror current version
httrack https://target.com -O ./version_current

# Later, compare with previous
diff -r ./version_previous ./version_current > changes.diff

# Or use find to identify new files
find ./version_current -newer ./version_previous -type f

Security and Legal Considerations

Authorization: Only mirror websites you own or have explicit written permission to test
Robots.txt Compliance: HTTrack respects robots.txt by default; override with care
Rate Limiting: Use appropriate concurrency settings to avoid DoS-like behavior
Copyright: Respect copyright laws; use mirrors for authorized security testing only
Confidentiality: Protect downloaded content containing sensitive information

wget: Command-line download utility
curl: HTTP client for single-file downloads
Burp Suite: Professional web application security testing
OWASP ZAP: Free automated web security scanning
curl: HTTP client for detailed analysis
grep/find: Content analysis and file discovery