Katana Web Crawler Cheat Sheet

Overview

Katana is a fast and customizable web crawling framework developed by Project Discovery. It's designed to crawl websites efficiently to gather information and discover endpoints. Katana stands out from other web crawlers due to its speed, flexibility, and focus on security testing use cases.

What makes Katana unique is its ability to intelligently crawl modern web applications, including single-page applications (SPAs) that rely heavily on JavaScript. It can handle complex web technologies and extract valuable information such as URLs, JavaScript files, API endpoints, and other web assets. Katana is built with security professionals in mind, making it an excellent tool for reconnaissance during security assessments and bug bounty hunting.

Katana supports various crawling strategies, including standard crawling, JavaScript parsing, and sitemap-based crawling. It can be customized to focus on specific types of resources or follow particular patterns, making it adaptable to different security testing scenarios. The tool is designed to be easily integrated into security testing workflows and can be combined with other Project Discovery tools for comprehensive reconnaissance.

Installation

Using Go

# Install using Go (requires Go 1.20 or later)
go install -v github.com/projectdiscovery/katana/cmd/katana@latest

# Verify installation
katana -version

Using Docker

# Pull the latest Docker image
docker pull projectdiscovery/katana:latest

# Run Katana using Docker
docker run -it projectdiscovery/katana:latest -h

Using Homebrew (macOS)

# Install using Homebrew
brew install katana

# Verify installation
katana -version

Using PDTM (Project Discovery Tools Manager)

# Install PDTM first if not already installed
go install -v github.com/projectdiscovery/pdtm/cmd/pdtm@latest

# Install Katana using PDTM
pdtm -i katana

# Verify installation
katana -version

On Kali Linux

# Install using apt
sudo apt install katana

# Verify installation
katana -version

Basic Usage

Crawling a Single URL

# Crawl a single URL
katana -u https://example.com

# Crawl with increased verbosity
katana -u https://example.com -v

# Crawl with debug information
katana -u https://example.com -debug

Crawling Multiple URLs

# Crawl multiple URLs
katana -u https://example.com,https://test.com

# Crawl from a list of URLs
katana -list urls.txt

# Crawl from STDIN
cat urls.txt|katana

Output Options

# Save results to a file
katana -u https://example.com -o results.txt

# Output in JSON format
katana -u https://example.com -json -o results.json

# Silent mode (only URLs)
katana -u https://example.com -silent

Crawling Options

Crawling Depth and Scope

# Set crawling depth (default: 2)
katana -u https://example.com -depth 3

# Crawl subdomains (default: false)
katana -u https://example.com -crawl-scope subs

# Crawl out of scope (default: false)
katana -u https://example.com -crawl-scope out-of-scope

# Crawl only in scope
katana -u https://example.com -crawl-scope strict

Crawling Strategies

# Use standard crawler
katana -u https://example.com -crawler standard

# Use JavaScript parser
katana -u https://example.com -crawler js

# Use sitemap-based crawler
katana -u https://example.com -crawler sitemap

# Use robots.txt-based crawler
katana -u https://example.com -crawler robots

# Use all crawlers
katana -u https://example.com -crawler standard,js,sitemap,robots

Field Selection

# Display specific fields
katana -u https://example.com -field url,path,method

# Available fields: url, path, method, host, fqdn, scheme, port, query, fragment, endpoint

Advanced Usage

URL Filtering

# Match URLs by regex
katana -u https://example.com -match-regex "admin|login|dashboard"

# Filter URLs by regex
katana -u https://example.com -filter-regex "logout|static|images"

# Match URLs by condition
katana -u https://example.com -field url -match-condition "contains('admin')"

Resource Filtering

# Include specific file extensions
katana -u https://example.com -extension js,php,aspx

# Exclude specific file extensions
katana -u https://example.com -exclude-extension png,jpg,gif

# Include specific MIME types
katana -u https://example.com -mime-type application/json,text/html

Form Filling

# Enable automatic form filling
katana -u https://example.com -form-fill

# Use custom form values
katana -u https://example.com -form-fill -field-name "username=admin&password=admin"

JavaScript Parsing

# Enable JavaScript parsing
katana -u https://example.com -js-crawl

# Set headless browser timeout
katana -u https://example.com -js-crawl -headless-timeout 20

# Set browser path
katana -u https://example.com -js-crawl -chrome-path /path/to/chrome

Performance Optimization

Concurrency and Rate Limiting

# Set concurrency (default: 10)
katana -u https://example.com -concurrency 20

# Set delay between requests (milliseconds)
katana -u https://example.com -delay 100

# Set rate limit (requests per second)
katana -u https://example.com -rate-limit 50

Timeout Options

# Set timeout for HTTP requests (seconds)
katana -u https://example.com -timeout 10

# Set timeout for headless browser (seconds)
katana -u https://example.com -js-crawl -headless-timeout 30

Optimization for Large Scans

# Disable automatic form filling for faster crawling
katana -u https://example.com -no-form-fill

# Disable JavaScript parsing for faster crawling
katana -u https://example.com -no-js-crawl

# Limit maximum URLs to crawl
katana -u https://example.com -max-urls 1000

Integration with Other Tools

Pipeline with Subfinder

# Find subdomains and crawl them
subfinder -d example.com -silent|katana -silent

# Find subdomains, crawl them, and extract JavaScript files
subfinder -d example.com -silent|katana -silent -extension js

Pipeline with HTTPX

# Probe URLs and crawl active ones
httpx -l urls.txt -silent|katana -silent

# Crawl and then probe discovered endpoints
katana -u https://example.com -silent|httpx -silent

Pipeline with Nuclei

# Crawl and scan for vulnerabilities
katana -u https://example.com -silent|nuclei -t cves/

# Crawl, extract JavaScript files, and scan for vulnerabilities
katana -u https://example.com -silent -extension js|nuclei -t exposures/

Output Customization

Custom Output Format

# Output only URLs
katana -u https://example.com -silent

# Output URLs with specific fields
katana -u https://example.com -field url,path,method -o results.txt

# Count discovered URLs
katana -u https://example.com -silent|wc -l

# Sort output alphabetically
katana -u https://example.com -silent|sort

Filtering Output

# Filter by file extension
katana -u https://example.com -silent|grep "\.js$"

# Filter by endpoint pattern
katana -u https://example.com -silent|grep "/api/"

# Find unique domains
katana -u https://example.com -silent|awk -F/ '\\\\{print $3\\\\}'|sort -u

Advanced Filtering

URL Pattern Matching

# Match specific URL patterns
katana -u https://example.com -match-regex "^https://example.com/admin"

# Filter out specific URL patterns
katana -u https://example.com -filter-regex "^https://example.com/static"

# Match URLs containing specific query parameters
katana -u https://example.com -match-regex "id=[0-9]+"

Content Filtering

# Match responses containing specific content
katana -u https://example.com -match-condition "contains(body, 'admin')"

# Filter responses by status code
katana -u https://example.com -match-condition "status == 200"

# Match responses by content type
katana -u https://example.com -match-condition "contains(content_type, 'application/json')"

Proxy and Network Options

# Use HTTP proxy
katana -u https://example.com -proxy http://127.0.0.1:8080

# Use SOCKS5 proxy
katana -u https://example.com -proxy socks5://127.0.0.1:1080

# Set custom headers
katana -u https://example.com -header "User-Agent: Mozilla/5.0" -header "Cookie: session=123456"

# Set custom cookies
katana -u https://example.com -cookie "session=123456; user=admin"

Miscellaneous Features

Automatic Form Filling

# Enable automatic form filling
katana -u https://example.com -form-fill

# Set custom form values
katana -u https://example.com -form-fill -field-name "username=admin&password=admin"

Crawling Specific Paths

# Crawl specific paths
katana -u https://example.com -paths /admin,/login,/dashboard

# Crawl from a file containing paths
katana -u https://example.com -paths-file paths.txt

Storing Responses

# Store all responses
katana -u https://example.com -store-response

# Specify response storage directory
katana -u https://example.com -store-response -store-response-dir responses/

Troubleshooting

Common Issues

JavaScript Parsing Issues ```bash # Increase headless browser timeout katana -u https://example.com -js-crawl -headless-timeout 30

# Specify Chrome path manually katana -u https://example.com -js-crawl -chrome-path /usr/bin/google-chrome ```

Rate Limiting by Target ```bash # Reduce concurrency katana -u https://example.com -concurrency 5

# Add delay between requests katana -u https://example.com -delay 500 ```

Memory Issues ```bash # Limit maximum URLs to crawl katana -u https://example.com -max-urls 500

# Disable JavaScript parsing katana -u https://example.com -no-js-crawl ```

Crawling Scope Issues ```bash # Restrict crawling to specific domain katana -u https://example.com -crawl-scope strict

# Allow crawling subdomains katana -u https://example.com -crawl-scope subs ```

Debugging

# Enable verbose mode
katana -u https://example.com -v

# Show debug information
katana -u https://example.com -debug

# Show request and response details
katana -u https://example.com -debug -show-request -show-response

Configuration

Configuration File

Katana uses a configuration file located at $HOME/.config/katana/config.yaml. You can customize various settings in this file:

# Example configuration file
concurrency: 10
delay: 100
timeout: 10
max-depth: 3
crawl-scope: strict
crawl-duration: 0
field: url,path,method
extensions: js,php,aspx

Environment Variables

# Set Katana configuration via environment variables
export KATANA_CONCURRENCY=10
export KATANA_DELAY=100
export KATANA_TIMEOUT=10
export KATANA_MAX_DEPTH=3

Reference

Command Line Options

Flag	Description
`-u, -url`	Target URL to crawl
`-list, -l`	File containing list of URLs to crawl
`-o, -output`	File to write output to
`-json`	Write output in JSON format
`-silent`	Show only URLs in output
`-v, -verbose`	Show verbose output
`-depth`	Maximum depth to crawl (default: 2)
`-crawl-scope`	Crawling scope (strict, subs, out-of-scope)
`-crawler`	Crawler types to use (standard, js, sitemap, robots)
`-field`	Fields to display in output
`-extension`	File extensions to include
`-exclude-extension`	File extensions to exclude
`-match-regex`	Regex pattern to match URLs
`-filter-regex`	Regex pattern to filter URLs
`-match-condition`	Condition to match URLs
`-form-fill`	Enable automatic form filling
`-js-crawl`	Enable JavaScript parsing
`-headless-timeout`	Timeout for headless browser (seconds)
`-chrome-path`	Path to Chrome browser
`-concurrency`	Number of concurrent requests
`-delay`	Delay between requests (milliseconds)
`-rate-limit`	Maximum number of requests per second
`-timeout`	Timeout for HTTP requests (seconds)
`-max-urls`	Maximum number of URLs to crawl
`-proxy`	HTTP/SOCKS5 proxy to use
`-header`	Custom header to add to all requests
`-cookie`	Custom cookies to add to all requests
`-paths`	Specific paths to crawl
`-paths-file`	File containing paths to crawl
`-store-response`	Store all responses
`-store-response-dir`	Directory to store responses
`-version`	Show Katana version

Crawling Scopes

Scope	Description
`strict`	Crawl only the exact domain provided
`subs`	Crawl the domain and its subdomains
`out-of-scope`	Crawl any domain, regardless of the initial domain

Crawler Types

Type	Description
`standard`	Standard HTTP crawler
`js`	JavaScript parser using headless browser
`sitemap`	Sitemap-based crawler
`robots`	Robots.txt-based crawler

Field Options

Field	Description
`url`	Full URL
`path`	URL path
`method`	HTTP method
`host`	Host part of URL
`fqdn`	Fully qualified domain name
`scheme`	URL scheme (http/https)
`port`	URL port
`query`	Query parameters
`fragment`	URL fragment
`endpoint`	URL endpoint

Resources

This cheat sheet provides a comprehensive reference for using Katana, from basic crawling to advanced filtering and integration with other tools. For the most up-to-date information, always refer to the official documentation.