Skip to content

Katana Web Crawler Cheat Sheet

Overview

Katana is a fast and customizable web crawling framework developed by Project Discovery. It's designed to crawl websites efficiently to gather information and discover endpoints. Katana stands out from other web crawlers due to its speed, flexibility, and focus on security testing use cases.

What makes Katana unique is its ability to intelligently crawl modern web applications, including single-page applications (SPAs) that rely heavily on JavaScript. It can handle complex web technologies and extract valuable information such as URLs, JavaScript files, API endpoints, and other web assets. Katana is built with security professionals in mind, making it an excellent tool for reconnaissance during security assessments and bug bounty hunting.

Katana supports various crawling strategies, including standard crawling, JavaScript parsing, and sitemap-based crawling. It can be customized to focus on specific types of resources or follow particular patterns, making it adaptable to different security testing scenarios. The tool is designed to be easily integrated into security testing workflows and can be combined with other Project Discovery tools for comprehensive reconnaissance.

Installation

Using Go

bash
# Install using Go (requires Go 1.20 or later)
go install -v github.com/projectdiscovery/katana/cmd/katana@latest

# Verify installation
katana -version

Using Docker

bash
# Pull the latest Docker image
docker pull projectdiscovery/katana:latest

# Run Katana using Docker
docker run -it projectdiscovery/katana:latest -h

Using Homebrew (macOS)

bash
# Install using Homebrew
brew install katana

# Verify installation
katana -version

Using PDTM (Project Discovery Tools Manager)

bash
# Install PDTM first if not already installed
go install -v github.com/projectdiscovery/pdtm/cmd/pdtm@latest

# Install Katana using PDTM
pdtm -i katana

# Verify installation
katana -version

On Kali Linux

bash
# Install using apt
sudo apt install katana

# Verify installation
katana -version

Basic Usage

Crawling a Single URL

bash
# Crawl a single URL
katana -u https://example.com

# Crawl with increased verbosity
katana -u https://example.com -v

# Crawl with debug information
katana -u https://example.com -debug

Crawling Multiple URLs

bash
# Crawl multiple URLs
katana -u https://example.com,https://test.com

# Crawl from a list of URLs
katana -list urls.txt

# Crawl from STDIN
cat urls.txt | katana

Output Options

bash
# Save results to a file
katana -u https://example.com -o results.txt

# Output in JSON format
katana -u https://example.com -json -o results.json

# Silent mode (only URLs)
katana -u https://example.com -silent

Crawling Options

Crawling Depth and Scope

bash
# Set crawling depth (default: 2)
katana -u https://example.com -depth 3

# Crawl subdomains (default: false)
katana -u https://example.com -crawl-scope subs

# Crawl out of scope (default: false)
katana -u https://example.com -crawl-scope out-of-scope

# Crawl only in scope
katana -u https://example.com -crawl-scope strict

Crawling Strategies

bash
# Use standard crawler
katana -u https://example.com -crawler standard

# Use JavaScript parser
katana -u https://example.com -crawler js

# Use sitemap-based crawler
katana -u https://example.com -crawler sitemap

# Use robots.txt-based crawler
katana -u https://example.com -crawler robots

# Use all crawlers
katana -u https://example.com -crawler standard,js,sitemap,robots

Field Selection

bash
# Display specific fields
katana -u https://example.com -field url,path,method

# Available fields: url, path, method, host, fqdn, scheme, port, query, fragment, endpoint

Advanced Usage

URL Filtering

bash
# Match URLs by regex
katana -u https://example.com -match-regex "admin|login|dashboard"

# Filter URLs by regex
katana -u https://example.com -filter-regex "logout|static|images"

# Match URLs by condition
katana -u https://example.com -field url -match-condition "contains('admin')"

Resource Filtering

bash
# Include specific file extensions
katana -u https://example.com -extension js,php,aspx

# Exclude specific file extensions
katana -u https://example.com -exclude-extension png,jpg,gif

# Include specific MIME types
katana -u https://example.com -mime-type application/json,text/html

Form Filling

bash
# Enable automatic form filling
katana -u https://example.com -form-fill

# Use custom form values
katana -u https://example.com -form-fill -field-name "username=admin&password=admin"

JavaScript Parsing

bash
# Enable JavaScript parsing
katana -u https://example.com -js-crawl

# Set headless browser timeout
katana -u https://example.com -js-crawl -headless-timeout 20

# Set browser path
katana -u https://example.com -js-crawl -chrome-path /path/to/chrome

Performance Optimization

Concurrency and Rate Limiting

bash
# Set concurrency (default: 10)
katana -u https://example.com -concurrency 20

# Set delay between requests (milliseconds)
katana -u https://example.com -delay 100

# Set rate limit (requests per second)
katana -u https://example.com -rate-limit 50

Timeout Options

bash
# Set timeout for HTTP requests (seconds)
katana -u https://example.com -timeout 10

# Set timeout for headless browser (seconds)
katana -u https://example.com -js-crawl -headless-timeout 30

Optimization for Large Scans

bash
# Disable automatic form filling for faster crawling
katana -u https://example.com -no-form-fill

# Disable JavaScript parsing for faster crawling
katana -u https://example.com -no-js-crawl

# Limit maximum URLs to crawl
katana -u https://example.com -max-urls 1000

Integration with Other Tools

Pipeline with Subfinder

bash
# Find subdomains and crawl them
subfinder -d example.com -silent | katana -silent

# Find subdomains, crawl them, and extract JavaScript files
subfinder -d example.com -silent | katana -silent -extension js

Pipeline with HTTPX

bash
# Probe URLs and crawl active ones
httpx -l urls.txt -silent | katana -silent

# Crawl and then probe discovered endpoints
katana -u https://example.com -silent | httpx -silent

Pipeline with Nuclei

bash
# Crawl and scan for vulnerabilities
katana -u https://example.com -silent | nuclei -t cves/

# Crawl, extract JavaScript files, and scan for vulnerabilities
katana -u https://example.com -silent -extension js | nuclei -t exposures/

Output Customization

Custom Output Format

bash
# Output only URLs
katana -u https://example.com -silent

# Output URLs with specific fields
katana -u https://example.com -field url,path,method -o results.txt

# Count discovered URLs
katana -u https://example.com -silent | wc -l

# Sort output alphabetically
katana -u https://example.com -silent | sort

Filtering Output

bash
# Filter by file extension
katana -u https://example.com -silent | grep "\.js$"

# Filter by endpoint pattern
katana -u https://example.com -silent | grep "/api/"

# Find unique domains
katana -u https://example.com -silent | awk -F/ '{print $3}' | sort -u

Advanced Filtering

URL Pattern Matching

bash
# Match specific URL patterns
katana -u https://example.com -match-regex "^https://example.com/admin"

# Filter out specific URL patterns
katana -u https://example.com -filter-regex "^https://example.com/static"

# Match URLs containing specific query parameters
katana -u https://example.com -match-regex "id=[0-9]+"

Content Filtering

bash
# Match responses containing specific content
katana -u https://example.com -match-condition "contains(body, 'admin')"

# Filter responses by status code
katana -u https://example.com -match-condition "status == 200"

# Match responses by content type
katana -u https://example.com -match-condition "contains(content_type, 'application/json')"

Proxy and Network Options

bash
# Use HTTP proxy
katana -u https://example.com -proxy http://127.0.0.1:8080

# Use SOCKS5 proxy
katana -u https://example.com -proxy socks5://127.0.0.1:1080

# Set custom headers
katana -u https://example.com -header "User-Agent: Mozilla/5.0" -header "Cookie: session=123456"

# Set custom cookies
katana -u https://example.com -cookie "session=123456; user=admin"

Miscellaneous Features

Automatic Form Filling

bash
# Enable automatic form filling
katana -u https://example.com -form-fill

# Set custom form values
katana -u https://example.com -form-fill -field-name "username=admin&password=admin"

Crawling Specific Paths

bash
# Crawl specific paths
katana -u https://example.com -paths /admin,/login,/dashboard

# Crawl from a file containing paths
katana -u https://example.com -paths-file paths.txt

Storing Responses

bash
# Store all responses
katana -u https://example.com -store-response

# Specify response storage directory
katana -u https://example.com -store-response -store-response-dir responses/

Troubleshooting

Common Issues

  1. JavaScript Parsing Issues

    bash
    # Increase headless browser timeout
    katana -u https://example.com -js-crawl -headless-timeout 30
    
    # Specify Chrome path manually
    katana -u https://example.com -js-crawl -chrome-path /usr/bin/google-chrome
  2. Rate Limiting by Target

    bash
    # Reduce concurrency
    katana -u https://example.com -concurrency 5
    
    # Add delay between requests
    katana -u https://example.com -delay 500
  3. Memory Issues

    bash
    # Limit maximum URLs to crawl
    katana -u https://example.com -max-urls 500
    
    # Disable JavaScript parsing
    katana -u https://example.com -no-js-crawl
  4. Crawling Scope Issues

    bash
    # Restrict crawling to specific domain
    katana -u https://example.com -crawl-scope strict
    
    # Allow crawling subdomains
    katana -u https://example.com -crawl-scope subs

Debugging

bash
# Enable verbose mode
katana -u https://example.com -v

# Show debug information
katana -u https://example.com -debug

# Show request and response details
katana -u https://example.com -debug -show-request -show-response

Configuration

Configuration File

Katana uses a configuration file located at $HOME/.config/katana/config.yaml. You can customize various settings in this file:

yaml
# Example configuration file
concurrency: 10
delay: 100
timeout: 10
max-depth: 3
crawl-scope: strict
crawl-duration: 0
field: url,path,method
extensions: js,php,aspx

Environment Variables

bash
# Set Katana configuration via environment variables
export KATANA_CONCURRENCY=10
export KATANA_DELAY=100
export KATANA_TIMEOUT=10
export KATANA_MAX_DEPTH=3

Reference

Command Line Options

FlagDescription
-u, -urlTarget URL to crawl
-list, -lFile containing list of URLs to crawl
-o, -outputFile to write output to
-jsonWrite output in JSON format
-silentShow only URLs in output
-v, -verboseShow verbose output
-depthMaximum depth to crawl (default: 2)
-crawl-scopeCrawling scope (strict, subs, out-of-scope)
-crawlerCrawler types to use (standard, js, sitemap, robots)
-fieldFields to display in output
-extensionFile extensions to include
-exclude-extensionFile extensions to exclude
-match-regexRegex pattern to match URLs
-filter-regexRegex pattern to filter URLs
-match-conditionCondition to match URLs
-form-fillEnable automatic form filling
-js-crawlEnable JavaScript parsing
-headless-timeoutTimeout for headless browser (seconds)
-chrome-pathPath to Chrome browser
-concurrencyNumber of concurrent requests
-delayDelay between requests (milliseconds)
-rate-limitMaximum number of requests per second
-timeoutTimeout for HTTP requests (seconds)
-max-urlsMaximum number of URLs to crawl
-proxyHTTP/SOCKS5 proxy to use
-headerCustom header to add to all requests
-cookieCustom cookies to add to all requests
-pathsSpecific paths to crawl
-paths-fileFile containing paths to crawl
-store-responseStore all responses
-store-response-dirDirectory to store responses
-versionShow Katana version

Crawling Scopes

ScopeDescription
strictCrawl only the exact domain provided
subsCrawl the domain and its subdomains
out-of-scopeCrawl any domain, regardless of the initial domain

Crawler Types

TypeDescription
standardStandard HTTP crawler
jsJavaScript parser using headless browser
sitemapSitemap-based crawler
robotsRobots.txt-based crawler

Field Options

FieldDescription
urlFull URL
pathURL path
methodHTTP method
hostHost part of URL
fqdnFully qualified domain name
schemeURL scheme (http/https)
portURL port
queryQuery parameters
fragmentURL fragment
endpointURL endpoint

Resources


This cheat sheet provides a comprehensive reference for using Katana, from basic crawling to advanced filtering and integration with other tools. For the most up-to-date information, always refer to the official documentation.