Katana Web Crawler Cheat Sheet

Overview

Katana is a fast and customizable web crawling framework developed by Project Discovery. It's designed to crawl websites efficiently to gather information and discover endpoints. Katana stands out from other web crawlers due to its speed, flexibility, and focus on security testing use cases.

What makes Katana unique is its ability to intelligently crawl modern web applications, including single-page applications (SPAs) that rely heavily on JavaScript. It can handle complex web technologies and extract valuable information such as URLs, JavaScript files, API endpoints, and other web assets. Katana is built with security professionals in mind, making it an excellent tool for reconnaissance during security assessments and bug bounty hunting.

Katana supports various crawling strategies, including standard crawling, JavaScript parsing, and sitemap-based crawling. It can be customized to focus on specific types of resources or follow particular patterns, making it adaptable to different security testing scenarios. The tool is designed to be easily integrated into security testing workflows and can be combined with other Project Discovery tools for comprehensive reconnaissance.

Installation

Using Go

bash

# Install using Go (requires Go 1.20 or later)
go install -v github.com/projectdiscovery/katana/cmd/katana@latest

# Verify installation
katana -version

Using Docker

bash

# Pull the latest Docker image
docker pull projectdiscovery/katana:latest

# Run Katana using Docker
docker run -it projectdiscovery/katana:latest -h

Using Homebrew (macOS)

bash

# Install using Homebrew
brew install katana

# Verify installation
katana -version

Using PDTM (Project Discovery Tools Manager)

bash

# Install PDTM first if not already installed
go install -v github.com/projectdiscovery/pdtm/cmd/pdtm@latest

# Install Katana using PDTM
pdtm -i katana

# Verify installation
katana -version

On Kali Linux

bash

# Install using apt
sudo apt install katana

# Verify installation
katana -version

Basic Usage

Crawling a Single URL

bash

# Crawl a single URL
katana -u https://example.com

# Crawl with increased verbosity
katana -u https://example.com -v

# Crawl with debug information
katana -u https://example.com -debug

Crawling Multiple URLs

bash

# Crawl multiple URLs
katana -u https://example.com,https://test.com

# Crawl from a list of URLs
katana -list urls.txt

# Crawl from STDIN
cat urls.txt | katana

Output Options

bash

# Save results to a file
katana -u https://example.com -o results.txt

# Output in JSON format
katana -u https://example.com -json -o results.json

# Silent mode (only URLs)
katana -u https://example.com -silent

Crawling Options

Crawling Depth and Scope

bash

# Set crawling depth (default: 2)
katana -u https://example.com -depth 3

# Crawl subdomains (default: false)
katana -u https://example.com -crawl-scope subs

# Crawl out of scope (default: false)
katana -u https://example.com -crawl-scope out-of-scope

# Crawl only in scope
katana -u https://example.com -crawl-scope strict

Crawling Strategies

bash

# Use standard crawler
katana -u https://example.com -crawler standard

# Use JavaScript parser
katana -u https://example.com -crawler js

# Use sitemap-based crawler
katana -u https://example.com -crawler sitemap

# Use robots.txt-based crawler
katana -u https://example.com -crawler robots

# Use all crawlers
katana -u https://example.com -crawler standard,js,sitemap,robots

Field Selection

bash

# Display specific fields
katana -u https://example.com -field url,path,method

# Available fields: url, path, method, host, fqdn, scheme, port, query, fragment, endpoint

Advanced Usage

URL Filtering

bash

# Match URLs by regex
katana -u https://example.com -match-regex "admin|login|dashboard"

# Filter URLs by regex
katana -u https://example.com -filter-regex "logout|static|images"

# Match URLs by condition
katana -u https://example.com -field url -match-condition "contains('admin')"

Resource Filtering

bash

# Include specific file extensions
katana -u https://example.com -extension js,php,aspx

# Exclude specific file extensions
katana -u https://example.com -exclude-extension png,jpg,gif

# Include specific MIME types
katana -u https://example.com -mime-type application/json,text/html

Form Filling

bash

# Enable automatic form filling
katana -u https://example.com -form-fill

# Use custom form values
katana -u https://example.com -form-fill -field-name "username=admin&password=admin"

JavaScript Parsing

bash

# Enable JavaScript parsing
katana -u https://example.com -js-crawl

# Set headless browser timeout
katana -u https://example.com -js-crawl -headless-timeout 20

# Set browser path
katana -u https://example.com -js-crawl -chrome-path /path/to/chrome

Performance Optimization

Concurrency and Rate Limiting

bash

# Set concurrency (default: 10)
katana -u https://example.com -concurrency 20

# Set delay between requests (milliseconds)
katana -u https://example.com -delay 100

# Set rate limit (requests per second)
katana -u https://example.com -rate-limit 50

Timeout Options

bash

# Set timeout for HTTP requests (seconds)
katana -u https://example.com -timeout 10

# Set timeout for headless browser (seconds)
katana -u https://example.com -js-crawl -headless-timeout 30

Optimization for Large Scans

bash

# Disable automatic form filling for faster crawling
katana -u https://example.com -no-form-fill

# Disable JavaScript parsing for faster crawling
katana -u https://example.com -no-js-crawl

# Limit maximum URLs to crawl
katana -u https://example.com -max-urls 1000

Integration with Other Tools

Pipeline with Subfinder

bash

# Find subdomains and crawl them
subfinder -d example.com -silent | katana -silent

# Find subdomains, crawl them, and extract JavaScript files
subfinder -d example.com -silent | katana -silent -extension js

Pipeline with HTTPX

bash

# Probe URLs and crawl active ones
httpx -l urls.txt -silent | katana -silent

# Crawl and then probe discovered endpoints
katana -u https://example.com -silent | httpx -silent

Pipeline with Nuclei

bash

# Crawl and scan for vulnerabilities
katana -u https://example.com -silent | nuclei -t cves/

# Crawl, extract JavaScript files, and scan for vulnerabilities
katana -u https://example.com -silent -extension js | nuclei -t exposures/

Output Customization

Custom Output Format

bash

# Output only URLs
katana -u https://example.com -silent

# Output URLs with specific fields
katana -u https://example.com -field url,path,method -o results.txt

# Count discovered URLs
katana -u https://example.com -silent | wc -l

# Sort output alphabetically
katana -u https://example.com -silent | sort

Filtering Output

bash

# Filter by file extension
katana -u https://example.com -silent | grep "\.js$"

# Filter by endpoint pattern
katana -u https://example.com -silent | grep "/api/"

# Find unique domains
katana -u https://example.com -silent | awk -F/ '{print $3}' | sort -u

Advanced Filtering

URL Pattern Matching

bash

# Match specific URL patterns
katana -u https://example.com -match-regex "^https://example.com/admin"

# Filter out specific URL patterns
katana -u https://example.com -filter-regex "^https://example.com/static"

# Match URLs containing specific query parameters
katana -u https://example.com -match-regex "id=[0-9]+"

Content Filtering

bash

# Match responses containing specific content
katana -u https://example.com -match-condition "contains(body, 'admin')"

# Filter responses by status code
katana -u https://example.com -match-condition "status == 200"

# Match responses by content type
katana -u https://example.com -match-condition "contains(content_type, 'application/json')"

Proxy and Network Options

bash

# Use HTTP proxy
katana -u https://example.com -proxy http://127.0.0.1:8080

# Use SOCKS5 proxy
katana -u https://example.com -proxy socks5://127.0.0.1:1080

# Set custom headers
katana -u https://example.com -header "User-Agent: Mozilla/5.0" -header "Cookie: session=123456"

# Set custom cookies
katana -u https://example.com -cookie "session=123456; user=admin"

Miscellaneous Features

Automatic Form Filling

bash

# Enable automatic form filling
katana -u https://example.com -form-fill

# Set custom form values
katana -u https://example.com -form-fill -field-name "username=admin&password=admin"

Crawling Specific Paths

bash

# Crawl specific paths
katana -u https://example.com -paths /admin,/login,/dashboard

# Crawl from a file containing paths
katana -u https://example.com -paths-file paths.txt

Storing Responses

bash

# Store all responses
katana -u https://example.com -store-response

# Specify response storage directory
katana -u https://example.com -store-response -store-response-dir responses/

Troubleshooting

Common Issues

JavaScript Parsing Issues

bash

# Increase headless browser timeout
katana -u https://example.com -js-crawl -headless-timeout 30

# Specify Chrome path manually
katana -u https://example.com -js-crawl -chrome-path /usr/bin/google-chrome

Rate Limiting by Target

bash

# Reduce concurrency
katana -u https://example.com -concurrency 5

# Add delay between requests
katana -u https://example.com -delay 500

Memory Issues

bash

# Limit maximum URLs to crawl
katana -u https://example.com -max-urls 500

# Disable JavaScript parsing
katana -u https://example.com -no-js-crawl

Crawling Scope Issues

bash

# Restrict crawling to specific domain
katana -u https://example.com -crawl-scope strict

# Allow crawling subdomains
katana -u https://example.com -crawl-scope subs

Debugging

bash

# Enable verbose mode
katana -u https://example.com -v

# Show debug information
katana -u https://example.com -debug

# Show request and response details
katana -u https://example.com -debug -show-request -show-response

Configuration

Configuration File

Katana uses a configuration file located at $HOME/.config/katana/config.yaml. You can customize various settings in this file:

yaml

# Example configuration file
concurrency: 10
delay: 100
timeout: 10
max-depth: 3
crawl-scope: strict
crawl-duration: 0
field: url,path,method
extensions: js,php,aspx

Environment Variables

bash

# Set Katana configuration via environment variables
export KATANA_CONCURRENCY=10
export KATANA_DELAY=100
export KATANA_TIMEOUT=10
export KATANA_MAX_DEPTH=3

Reference

Command Line Options

Flag	Description
`-u, -url`	Target URL to crawl
`-list, -l`	File containing list of URLs to crawl
`-o, -output`	File to write output to
`-json`	Write output in JSON format
`-silent`	Show only URLs in output
`-v, -verbose`	Show verbose output
`-depth`	Maximum depth to crawl (default: 2)
`-crawl-scope`	Crawling scope (strict, subs, out-of-scope)
`-crawler`	Crawler types to use (standard, js, sitemap, robots)
`-field`	Fields to display in output
`-extension`	File extensions to include
`-exclude-extension`	File extensions to exclude
`-match-regex`	Regex pattern to match URLs
`-filter-regex`	Regex pattern to filter URLs
`-match-condition`	Condition to match URLs
`-form-fill`	Enable automatic form filling
`-js-crawl`	Enable JavaScript parsing
`-headless-timeout`	Timeout for headless browser (seconds)
`-chrome-path`	Path to Chrome browser
`-concurrency`	Number of concurrent requests
`-delay`	Delay between requests (milliseconds)
`-rate-limit`	Maximum number of requests per second
`-timeout`	Timeout for HTTP requests (seconds)
`-max-urls`	Maximum number of URLs to crawl
`-proxy`	HTTP/SOCKS5 proxy to use
`-header`	Custom header to add to all requests
`-cookie`	Custom cookies to add to all requests
`-paths`	Specific paths to crawl
`-paths-file`	File containing paths to crawl
`-store-response`	Store all responses
`-store-response-dir`	Directory to store responses
`-version`	Show Katana version

Crawling Scopes

Scope	Description
`strict`	Crawl only the exact domain provided
`subs`	Crawl the domain and its subdomains
`out-of-scope`	Crawl any domain, regardless of the initial domain

Crawler Types

Type	Description
`standard`	Standard HTTP crawler
`js`	JavaScript parser using headless browser
`sitemap`	Sitemap-based crawler
`robots`	Robots.txt-based crawler

Field Options

Field	Description
`url`	Full URL
`path`	URL path
`method`	HTTP method
`host`	Host part of URL
`fqdn`	Fully qualified domain name
`scheme`	URL scheme (http/https)
`port`	URL port
`query`	Query parameters
`fragment`	URL fragment
`endpoint`	URL endpoint

Resources

This cheat sheet provides a comprehensive reference for using Katana, from basic crawling to advanced filtering and integration with other tools. For the most up-to-date information, always refer to the official documentation.

Katana Web Crawler Cheat Sheet ​

Overview ​

Installation ​

Using Go ​

Using Docker ​

Using Homebrew (macOS) ​

Using PDTM (Project Discovery Tools Manager) ​

On Kali Linux ​

Basic Usage ​

Crawling a Single URL ​

Crawling Multiple URLs ​

Output Options ​

Crawling Options ​

Crawling Depth and Scope ​

Crawling Strategies ​

Field Selection ​

Advanced Usage ​

URL Filtering ​

Resource Filtering ​

Form Filling ​

JavaScript Parsing ​

Performance Optimization ​

Concurrency and Rate Limiting ​

Timeout Options ​

Optimization for Large Scans ​

Integration with Other Tools ​

Pipeline with Subfinder ​

Pipeline with HTTPX ​

Pipeline with Nuclei ​

Output Customization ​

Custom Output Format ​

Filtering Output ​

Advanced Filtering ​

URL Pattern Matching ​

Content Filtering ​

Proxy and Network Options ​

Miscellaneous Features ​

Automatic Form Filling ​

Crawling Specific Paths ​

Storing Responses ​

Troubleshooting ​

Common Issues ​

Debugging ​

Configuration ​

Configuration File ​

Environment Variables ​

Reference ​

Command Line Options ​

Crawling Scopes ​

Crawler Types ​

Field Options ​

Resources ​

Katana Web Crawler Cheat Sheet

Overview

Installation

Using Go

Using Docker

Using Homebrew (macOS)

Using PDTM (Project Discovery Tools Manager)

On Kali Linux

Basic Usage

Crawling a Single URL

Crawling Multiple URLs

Output Options

Crawling Options

Crawling Depth and Scope

Crawling Strategies

Field Selection

Advanced Usage

URL Filtering

Resource Filtering

Form Filling

JavaScript Parsing

Performance Optimization

Concurrency and Rate Limiting

Timeout Options

Optimization for Large Scans

Integration with Other Tools

Pipeline with Subfinder

Pipeline with HTTPX

Pipeline with Nuclei

Output Customization

Custom Output Format

Filtering Output

Advanced Filtering

URL Pattern Matching

Content Filtering

Proxy and Network Options

Miscellaneous Features

Automatic Form Filling

Crawling Specific Paths

Storing Responses

Troubleshooting

Common Issues

Debugging

Configuration

Configuration File

Environment Variables

Reference

Command Line Options

Crawling Scopes

Crawler Types

Field Options

Resources