Appearance
Katana Web Crawler Cheat Sheet
Overview
Katana is a fast and customizable web crawling framework developed by Project Discovery. It's designed to crawl websites efficiently to gather information and discover endpoints. Katana stands out from other web crawlers due to its speed, flexibility, and focus on security testing use cases.
What makes Katana unique is its ability to intelligently crawl modern web applications, including single-page applications (SPAs) that rely heavily on JavaScript. It can handle complex web technologies and extract valuable information such as URLs, JavaScript files, API endpoints, and other web assets. Katana is built with security professionals in mind, making it an excellent tool for reconnaissance during security assessments and bug bounty hunting.
Katana supports various crawling strategies, including standard crawling, JavaScript parsing, and sitemap-based crawling. It can be customized to focus on specific types of resources or follow particular patterns, making it adaptable to different security testing scenarios. The tool is designed to be easily integrated into security testing workflows and can be combined with other Project Discovery tools for comprehensive reconnaissance.
Installation
Using Go
bash
# Install using Go (requires Go 1.20 or later)
go install -v github.com/projectdiscovery/katana/cmd/katana@latest
# Verify installation
katana -version
Using Docker
bash
# Pull the latest Docker image
docker pull projectdiscovery/katana:latest
# Run Katana using Docker
docker run -it projectdiscovery/katana:latest -h
Using Homebrew (macOS)
bash
# Install using Homebrew
brew install katana
# Verify installation
katana -version
Using PDTM (Project Discovery Tools Manager)
bash
# Install PDTM first if not already installed
go install -v github.com/projectdiscovery/pdtm/cmd/pdtm@latest
# Install Katana using PDTM
pdtm -i katana
# Verify installation
katana -version
On Kali Linux
bash
# Install using apt
sudo apt install katana
# Verify installation
katana -version
Basic Usage
Crawling a Single URL
bash
# Crawl a single URL
katana -u https://example.com
# Crawl with increased verbosity
katana -u https://example.com -v
# Crawl with debug information
katana -u https://example.com -debug
Crawling Multiple URLs
bash
# Crawl multiple URLs
katana -u https://example.com,https://test.com
# Crawl from a list of URLs
katana -list urls.txt
# Crawl from STDIN
cat urls.txt | katana
Output Options
bash
# Save results to a file
katana -u https://example.com -o results.txt
# Output in JSON format
katana -u https://example.com -json -o results.json
# Silent mode (only URLs)
katana -u https://example.com -silent
Crawling Options
Crawling Depth and Scope
bash
# Set crawling depth (default: 2)
katana -u https://example.com -depth 3
# Crawl subdomains (default: false)
katana -u https://example.com -crawl-scope subs
# Crawl out of scope (default: false)
katana -u https://example.com -crawl-scope out-of-scope
# Crawl only in scope
katana -u https://example.com -crawl-scope strict
Crawling Strategies
bash
# Use standard crawler
katana -u https://example.com -crawler standard
# Use JavaScript parser
katana -u https://example.com -crawler js
# Use sitemap-based crawler
katana -u https://example.com -crawler sitemap
# Use robots.txt-based crawler
katana -u https://example.com -crawler robots
# Use all crawlers
katana -u https://example.com -crawler standard,js,sitemap,robots
Field Selection
bash
# Display specific fields
katana -u https://example.com -field url,path,method
# Available fields: url, path, method, host, fqdn, scheme, port, query, fragment, endpoint
Advanced Usage
URL Filtering
bash
# Match URLs by regex
katana -u https://example.com -match-regex "admin|login|dashboard"
# Filter URLs by regex
katana -u https://example.com -filter-regex "logout|static|images"
# Match URLs by condition
katana -u https://example.com -field url -match-condition "contains('admin')"
Resource Filtering
bash
# Include specific file extensions
katana -u https://example.com -extension js,php,aspx
# Exclude specific file extensions
katana -u https://example.com -exclude-extension png,jpg,gif
# Include specific MIME types
katana -u https://example.com -mime-type application/json,text/html
Form Filling
bash
# Enable automatic form filling
katana -u https://example.com -form-fill
# Use custom form values
katana -u https://example.com -form-fill -field-name "username=admin&password=admin"
JavaScript Parsing
bash
# Enable JavaScript parsing
katana -u https://example.com -js-crawl
# Set headless browser timeout
katana -u https://example.com -js-crawl -headless-timeout 20
# Set browser path
katana -u https://example.com -js-crawl -chrome-path /path/to/chrome
Performance Optimization
Concurrency and Rate Limiting
bash
# Set concurrency (default: 10)
katana -u https://example.com -concurrency 20
# Set delay between requests (milliseconds)
katana -u https://example.com -delay 100
# Set rate limit (requests per second)
katana -u https://example.com -rate-limit 50
Timeout Options
bash
# Set timeout for HTTP requests (seconds)
katana -u https://example.com -timeout 10
# Set timeout for headless browser (seconds)
katana -u https://example.com -js-crawl -headless-timeout 30
Optimization for Large Scans
bash
# Disable automatic form filling for faster crawling
katana -u https://example.com -no-form-fill
# Disable JavaScript parsing for faster crawling
katana -u https://example.com -no-js-crawl
# Limit maximum URLs to crawl
katana -u https://example.com -max-urls 1000
Integration with Other Tools
Pipeline with Subfinder
bash
# Find subdomains and crawl them
subfinder -d example.com -silent | katana -silent
# Find subdomains, crawl them, and extract JavaScript files
subfinder -d example.com -silent | katana -silent -extension js
Pipeline with HTTPX
bash
# Probe URLs and crawl active ones
httpx -l urls.txt -silent | katana -silent
# Crawl and then probe discovered endpoints
katana -u https://example.com -silent | httpx -silent
Pipeline with Nuclei
bash
# Crawl and scan for vulnerabilities
katana -u https://example.com -silent | nuclei -t cves/
# Crawl, extract JavaScript files, and scan for vulnerabilities
katana -u https://example.com -silent -extension js | nuclei -t exposures/
Output Customization
Custom Output Format
bash
# Output only URLs
katana -u https://example.com -silent
# Output URLs with specific fields
katana -u https://example.com -field url,path,method -o results.txt
# Count discovered URLs
katana -u https://example.com -silent | wc -l
# Sort output alphabetically
katana -u https://example.com -silent | sort
Filtering Output
bash
# Filter by file extension
katana -u https://example.com -silent | grep "\.js$"
# Filter by endpoint pattern
katana -u https://example.com -silent | grep "/api/"
# Find unique domains
katana -u https://example.com -silent | awk -F/ '{print $3}' | sort -u
Advanced Filtering
URL Pattern Matching
bash
# Match specific URL patterns
katana -u https://example.com -match-regex "^https://example.com/admin"
# Filter out specific URL patterns
katana -u https://example.com -filter-regex "^https://example.com/static"
# Match URLs containing specific query parameters
katana -u https://example.com -match-regex "id=[0-9]+"
Content Filtering
bash
# Match responses containing specific content
katana -u https://example.com -match-condition "contains(body, 'admin')"
# Filter responses by status code
katana -u https://example.com -match-condition "status == 200"
# Match responses by content type
katana -u https://example.com -match-condition "contains(content_type, 'application/json')"
Proxy and Network Options
bash
# Use HTTP proxy
katana -u https://example.com -proxy http://127.0.0.1:8080
# Use SOCKS5 proxy
katana -u https://example.com -proxy socks5://127.0.0.1:1080
# Set custom headers
katana -u https://example.com -header "User-Agent: Mozilla/5.0" -header "Cookie: session=123456"
# Set custom cookies
katana -u https://example.com -cookie "session=123456; user=admin"
Miscellaneous Features
Automatic Form Filling
bash
# Enable automatic form filling
katana -u https://example.com -form-fill
# Set custom form values
katana -u https://example.com -form-fill -field-name "username=admin&password=admin"
Crawling Specific Paths
bash
# Crawl specific paths
katana -u https://example.com -paths /admin,/login,/dashboard
# Crawl from a file containing paths
katana -u https://example.com -paths-file paths.txt
Storing Responses
bash
# Store all responses
katana -u https://example.com -store-response
# Specify response storage directory
katana -u https://example.com -store-response -store-response-dir responses/
Troubleshooting
Common Issues
JavaScript Parsing Issues
bash# Increase headless browser timeout katana -u https://example.com -js-crawl -headless-timeout 30 # Specify Chrome path manually katana -u https://example.com -js-crawl -chrome-path /usr/bin/google-chrome
Rate Limiting by Target
bash# Reduce concurrency katana -u https://example.com -concurrency 5 # Add delay between requests katana -u https://example.com -delay 500
Memory Issues
bash# Limit maximum URLs to crawl katana -u https://example.com -max-urls 500 # Disable JavaScript parsing katana -u https://example.com -no-js-crawl
Crawling Scope Issues
bash# Restrict crawling to specific domain katana -u https://example.com -crawl-scope strict # Allow crawling subdomains katana -u https://example.com -crawl-scope subs
Debugging
bash
# Enable verbose mode
katana -u https://example.com -v
# Show debug information
katana -u https://example.com -debug
# Show request and response details
katana -u https://example.com -debug -show-request -show-response
Configuration
Configuration File
Katana uses a configuration file located at $HOME/.config/katana/config.yaml
. You can customize various settings in this file:
yaml
# Example configuration file
concurrency: 10
delay: 100
timeout: 10
max-depth: 3
crawl-scope: strict
crawl-duration: 0
field: url,path,method
extensions: js,php,aspx
Environment Variables
bash
# Set Katana configuration via environment variables
export KATANA_CONCURRENCY=10
export KATANA_DELAY=100
export KATANA_TIMEOUT=10
export KATANA_MAX_DEPTH=3
Reference
Command Line Options
Flag | Description |
---|---|
-u, -url | Target URL to crawl |
-list, -l | File containing list of URLs to crawl |
-o, -output | File to write output to |
-json | Write output in JSON format |
-silent | Show only URLs in output |
-v, -verbose | Show verbose output |
-depth | Maximum depth to crawl (default: 2) |
-crawl-scope | Crawling scope (strict, subs, out-of-scope) |
-crawler | Crawler types to use (standard, js, sitemap, robots) |
-field | Fields to display in output |
-extension | File extensions to include |
-exclude-extension | File extensions to exclude |
-match-regex | Regex pattern to match URLs |
-filter-regex | Regex pattern to filter URLs |
-match-condition | Condition to match URLs |
-form-fill | Enable automatic form filling |
-js-crawl | Enable JavaScript parsing |
-headless-timeout | Timeout for headless browser (seconds) |
-chrome-path | Path to Chrome browser |
-concurrency | Number of concurrent requests |
-delay | Delay between requests (milliseconds) |
-rate-limit | Maximum number of requests per second |
-timeout | Timeout for HTTP requests (seconds) |
-max-urls | Maximum number of URLs to crawl |
-proxy | HTTP/SOCKS5 proxy to use |
-header | Custom header to add to all requests |
-cookie | Custom cookies to add to all requests |
-paths | Specific paths to crawl |
-paths-file | File containing paths to crawl |
-store-response | Store all responses |
-store-response-dir | Directory to store responses |
-version | Show Katana version |
Crawling Scopes
Scope | Description |
---|---|
strict | Crawl only the exact domain provided |
subs | Crawl the domain and its subdomains |
out-of-scope | Crawl any domain, regardless of the initial domain |
Crawler Types
Type | Description |
---|---|
standard | Standard HTTP crawler |
js | JavaScript parser using headless browser |
sitemap | Sitemap-based crawler |
robots | Robots.txt-based crawler |
Field Options
Field | Description |
---|---|
url | Full URL |
path | URL path |
method | HTTP method |
host | Host part of URL |
fqdn | Fully qualified domain name |
scheme | URL scheme (http/https) |
port | URL port |
query | Query parameters |
fragment | URL fragment |
endpoint | URL endpoint |
Resources
This cheat sheet provides a comprehensive reference for using Katana, from basic crawling to advanced filtering and integration with other tools. For the most up-to-date information, always refer to the official documentation.