コンテンツにスキップ

HTTrack

Overview

HTTrack Website Copier is a free, portable utility that downloads entire websites to your computer, creating a complete offline mirror. It’s invaluable during security assessments for analyzing web applications, discovering hidden directories and files, identifying server configurations, and understanding application architecture. HTTrack is available in Kali Linux and supports multiple platforms.

The tool handles cookies, authentication, JavaScript execution, and can filter content by MIME type, URL patterns, and file extensions. It’s particularly useful for analyzing large web applications and discovering security misconfigurations.

Installation

# Kali Linux (pre-installed)
httrack --version

# Debian/Ubuntu
sudo apt-get install httrack webhttrack

# macOS
brew install httrack

# From source
git clone https://github.com/xroche/httrack
cd httrack
./configure && make
sudo make install

Basic Usage

Command Syntax

httrack <options> <url> [<url2> ...] [-O <folder>]

Simple Website Mirroring

CommandDescription
httrack http://example.comMirror entire website locally
httrack https://example.com -O ./mirrorSave to custom directory
httrack http://example.com/pathMirror specific path only
httrack --helpDisplay help information
httrack --versionShow version information

Basic Examples

# Mirror a simple website
httrack http://example.com

# Mirror with custom output directory
httrack http://example.com -O ./website_mirror

# Mirror HTTPS website
httrack https://example.com -O ./secure_mirror

# Mirror multiple URLs
httrack http://example.com http://subdomain.example.com -O ./multi_mirror

Common Options

Mirror Scope and Depth

OptionDescriptionExample
-rSet recursion depthhttrack -r 5 http://example.com
-mMaximum file size (KB)httrack -m 50000 http://example.com
-cNumber of simultaneous connectionshttrack -c 8 http://example.com
-eExecution level (0=none, 1=JS, etc.)httrack -e 1 http://example.com

File and Content Filtering

OptionDescriptionExample
-AAccept MIME typeshttrack -A text/html,text/css
-RReject MIME typeshttrack -R .exe,.zip,.iso
-%FFollow FTP linkshttrack -%F http://example.com
--spiderSpider mode (no download)httrack --spider http://example.com

Behavior and Performance

OptionDescriptionExample
-NNever overwrite existing fileshttrack -N http://example.com
-nMaximum files to downloadhttrack -n 10000 http://example.com
-TConnection timeout (seconds)httrack -T 60 http://example.com
-IIdentify as browser/bothttrack -I http://example.com

Advanced Options

Authentication and Cookies

# Provide authentication
httrack http://user:password@example.com -O ./authenticated

# Cookie handling
httrack http://example.com -%c -O ./with_cookies

# Custom User-Agent
httrack http://example.com -u "Mozilla/5.0" -O ./custom_ua

URL Filtering

# Include only specific paths
httrack http://example.com/app/* http://example.com/api/* -O ./filtered

# Exclude specific paths
httrack http://example.com -* +*.jpg +*.png -O ./images_only

# Mirror external links (within domain)
httrack http://example.com -%e -O ./external

# Mirror subdomains
httrack http://example.com http://*.example.com -O ./subdomains

Advanced Mirroring Options

# Deep recursive mirror (level 10)
httrack -r 10 http://example.com -O ./deep_mirror

# Large file limits
httrack -m 500000 http://example.com -O ./large_files

# Multiple connections (faster)
httrack -c 16 http://example.com -O ./fast

# Disable Java, Flash, etc.
httrack -* +*.html +*.htm +*.css +*.js +*.jpg +*.gif +*.png http://example.com

Reconnaissance Workflows

Web Application Architecture Discovery

# Mirror target application
httrack https://target-app.com -r 8 -O ./target_mirror

# Analyze directory structure
find ./target_mirror -type d | head -20

# Identify file types
find ./target_mirror -type f | sed 's/.*\.//' | sort | uniq -c

# Extract all URLs
grep -roP 'href="[^"]*"' ./target_mirror/html | cut -d'"' -f2 | sort -u

API Endpoint Discovery

# Mirror API documentation
httrack https://api.example.com -r 6 -O ./api_mirror

# Extract API endpoints
grep -roP '/(api|v[0-9]+)/?[a-zA-Z0-9/_-]*' ./api_mirror/html | sort -u

# Find parameter patterns
grep -roP '\?[a-zA-Z0-9_&=]*' ./api_mirror/html | sort -u

Configuration and Secrets Discovery

# Mirror entire site
httrack http://example.com -r 10 -O ./full_mirror

# Search for config files
find ./full_mirror -name "*.conf" -o -name "*.config" -o -name "*.json" -o -name "*.xml"

# Look for hardcoded credentials
grep -r "password\|apikey\|token\|secret" ./full_mirror/html

# Extract JavaScript for analysis
find ./full_mirror -name "*.js" -type f | head -20

Practical Examples

Example 1: Basic Website Mirror

# Mirror simple website with default settings
httrack http://example.com -O ./mirror_$(date +%Y%m%d)

# Navigate to results
cd mirror_example.com/
ls -la

# Open in browser
firefox index.html

Example 2: Deep Application Analysis

# Mirror with aggressive settings for app discovery
httrack \
  -r 10 \
  -m 100000 \
  -c 16 \
  -T 60 \
  http://target.local:8080 \
  -O ./deep_analysis

# Search for interesting files
find ./deep_analysis -type f \( \
  -name "*.js" -o \
  -name "*.json" -o \
  -name "*.xml" -o \
  -name "*.config" \
\) | wc -l

Example 3: API-Focused Mirror

# Mirror API with specific patterns
httrack \
  -r 6 \
  "https://api.example.com/*" \
  "https://api.example.com/v1/*" \
  "https://api.example.com/v2/*" \
  -O ./api_analysis

# Extract endpoints
grep -roP '/(api|v[0-9]+)/[a-zA-Z0-9/_-]*' ./api_analysis/html | sort -u > endpoints.txt

# Count discovered endpoints
wc -l endpoints.txt

Example 4: Selective Content Mirror

# Mirror only JavaScript and HTML files
httrack \
  -r 8 \
  -* \
  +*.html \
  +*.htm \
  +*.js \
  http://example.com \
  -O ./js_analysis

# Analyze JavaScript sizes
du -h ./js_analysis/html -s
find ./js_analysis -name "*.js" | wc -l

Output and Analysis

Directory Structure

mirror_example.com/
├── index.html                 # Site homepage
├── hts-cache/                # HTTrack cache files
│   ├── new.txt              # Newly discovered URLs
│   ├── seen.txt             # Already processed URLs
│   └── cache.txt            # Cache information
├── backblue.gif
├── cookies.txt              # Saved cookies
└── html/
    ├── example.com/
    │   ├── index.html
    │   ├── about/
    │   ├── contact/
    │   └── ...

Useful Analysis Commands

# Count total files downloaded
find ./mirror -type f | wc -l

# Find all JavaScript files
find ./mirror -name "*.js" | wc -l

# List largest files
du -h ./mirror -S | sort -rh | head -20

# Extract all URLs from HTML
grep -roP '(href|src|action)="[^"]*"' ./mirror -h | cut -d'"' -f2 | sort -u

# Find commented-out code
grep -r "<!--" ./mirror/html | head -20

# Search for API endpoints in JS
grep -r "fetch\|XMLHttpRequest\|axios\|jQuery.ajax" ./mirror -h | head -20

HTTrack GUI (WebHTTrack)

Graphical Interface

# Launch graphical interface
webhttrack

# Or via command line
webhttrack http://example.com

GUI Features

  • Point-and-click URL configuration
  • Visual progress monitoring
  • Pause/resume capability
  • Browser-based interface
  • Project management and history

GUI Usage

# Start web interface (port 8080)
webhttrack

# Access at http://localhost:8080
# Configure URLs, options, and monitor progress

Performance Optimization

Multi-Connection Mirroring

# Faster download with more connections
httrack -c 16 http://example.com -O ./fast_mirror

# For very large sites with files
httrack -c 32 -r 10 -m 200000 http://large-site.com -O ./large_mirror

Bandwidth Control

# Limit bandwidth (1MB/s)
httrack --max-rate 1000 http://example.com

# Smaller timeout for unresponsive servers
httrack -T 30 http://slow-server.com

Troubleshooting

IssueSolution
Connection refusedCheck URL, firewall, or proxy settings
Incomplete mirrorIncrease recursion depth with -r
Large downloadsSet size limit with -m or use file filters
Authentication failedProvide credentials in URL: http://user:pass@host
JavaScript not executedEnable with -e 1 flag
Timeout errorsIncrease timeout: -T 120

Advanced Reconnaissance

Full Application Security Testing

# Comprehensive mirror for security analysis
httrack \
  -r 10 \
  -m 100000 \
  -c 16 \
  -T 60 \
  -u "Mozilla/5.0 (Windows)" \
  https://target.com \
  -O ./security_assessment

# Archive the mirror
tar -czf target_mirror_$(date +%Y%m%d_%H%M%S).tar.gz ./security_assessment

# Create inventory
find ./security_assessment -type f > inventory.txt
wc -l inventory.txt

Comparing Two Website Versions

# Mirror current version
httrack https://target.com -O ./version_current

# Later, compare with previous
diff -r ./version_previous ./version_current > changes.diff

# Or use find to identify new files
find ./version_current -newer ./version_previous -type f
  • Authorization: Only mirror websites you own or have explicit written permission to test
  • Robots.txt Compliance: HTTrack respects robots.txt by default; override with care
  • Rate Limiting: Use appropriate concurrency settings to avoid DoS-like behavior
  • Copyright: Respect copyright laws; use mirrors for authorized security testing only
  • Confidentiality: Protect downloaded content containing sensitive information
  • wget: Command-line download utility
  • curl: HTTP client for single-file downloads
  • Burp Suite: Professional web application security testing
  • OWASP ZAP: Free automated web security scanning
  • curl: HTTP client for detailed analysis
  • grep/find: Content analysis and file discovery