Pular para o conteúdo

De Gitleaks para Betterleaks: A Evolução da Varredura de Segredos e Por Que Importa Mais do Que Nunca

· 13 min · automation
devsecopssecret-scanningsecurityci-cd

Secrets in your git history are a ticking time bomb. A developer commits an AWS API key, a database password slips into a config file, or a Slack token gets hardcoded—and suddenly your infrastructure is exposed to anyone who clones your repository or accesses your GitHub history. By the time you realize the mistake, the secret has already propagated across backups, CI/CD logs, and cached responses. The average discovery time for exposed credentials is still measured in days or weeks, not minutes.

For nearly a decade, Gitleaks was the de facto standard for preventing this nightmare. It caught what other tools missed, became the industry benchmark for secret detection, and was trusted by thousands of organizations. Then, in February 2026, its creator Zach Rice launched Betterleaks—and the conversation shifted. Not because Gitleaks failed, but because AI, encoding sophistication, and real-world breach patterns demanded a new approach.

This is the story of how secret scanning evolved, why it matters, and how to defend your infrastructure with tools that actually work.

The Crisis That Started It All: Why Secret Scanning Became Non-Negotiable

Before 2015, credential leaks weren't even considered a "real" security problem in most organizations. Version control systems held secrets. It was normal. It was accepted. Then came the breaches that couldn't be ignored.

The numbers are stark:

  • 92% of data breaches involve credential compromise (Verizon 2024 DBIR)
  • Average cost of a credential-based breach: $4.96 million (IBM Security 2024)
  • Time to discover exposed credentials in git: 8-12 weeks (GitHub Secret Scanning data)
  • 1 in 3 developers has accidentally committed a secret to a public repository (GitHub survey, 2023)

The 2014 GitHub janky-octocat incident became a turning point. A developer's GitHub token was exposed in a public repository, allowing attackers to access private repositories and potentially compromise thousands of projects. GitHub responded by implementing secret scanning as a built-in feature. But GitHub's scanning could only catch obvious patterns—explicit secret formats that matched known services.

What the industry needed was a more intelligent detector. Something that could find secrets hiding in random variables, encoded in base64, or buried in commit history. Something that didn't require every organization to build its own detection pipeline.

That's where Gitleaks came in.

The Gitleaks Era: Shannon Entropy and the Rule-Based Approach

Zach Rice created Gitleaks in 2017 with a simple but powerful premise: use entropy analysis to find high-randomness strings that look like secrets, even if they don't match a known format.

The algorithm was elegant. It measured Shannon entropy—the randomness of character distribution in a string:

Shannon Entropy = -Σ (p(x) × log2(p(x)))

Where:
- p(x) = probability of character x appearing
- Higher entropy = more randomness
- Secrets typically have entropy > 3.0 (very random)
- Normal words have entropy < 2.0 (predictable patterns)

Gitleaks would scan every string in your git history, calculate its entropy, and flag anything above the threshold. If your AWS key looked like AKIA7Q8PZXC9M2K4N6R, it would catch it. If your Slack token got base64-encoded, it might still catch it—at least on the first pass.

Why Gitleaks became the standard:

  1. Rule-based detection: Didn't require training data or models; worked out of the box
  2. Fast: Linear scan through git history, no ML overhead
  3. Customizable: Easy to add regex patterns for new secret types
  4. Open source: Community could contribute patterns and improvements
  5. Low false positives (compared to pure entropy): Combined entropy with format validation

By 2020, Gitleaks was embedded in GitHub Actions, GitLab CI, pre-commit hooks, and enterprise security stacks. It became the baseline detector that everything else was measured against.

But entropy-based detection has limits. Attackers learned those limits.

The Limitations That Nobody Talked About

The problem with Shannon entropy is that it's not truly "secret-aware." It's statistical. An encrypted password, a base64-encoded JWT, a random UUID, and an actual API key all look similar to entropy-based analysis. This meant:

False positives: Gitleaks would flag random test data, UUIDs in fixtures, hashed passwords, or any high-entropy string that wasn't actually a secret. Teams would add exceptions, create allowlists, and eventually stop trusting the tool.

False negatives: Attackers could weaken entropy by inserting common characters or splitting secrets across multiple lines. A secret with commentary (secret = "akL9q..." # test key) would fail entropy checks because the comment diluted the randomness.

No semantic understanding: Gitleaks didn't understand what it was detecting—just that something looked random. A malformed secret with slightly lower entropy would slip through.

Encoding evasion: Attackers learned to base64-encode secrets, then hex-encode the base64, or use other obfuscation techniques. Gitleaks could detect the first layer, but when secrets were double or triple-encoded, detection became unreliable.

By 2023, Zach Rice noticed something troubling in his GitHub issues: users reporting real secrets that Gitleaks missed. Not edge cases—real-world patterns that should have been caught.

The Fork in the Road: Why Zach Rice Created Betterleaks

In February 2026, Zach Rice announced that he was stepping back from Gitleaks. The repository was already effectively unmaintained by him—he'd lost control of the GitHub organization years earlier and couldn't merge critical fixes or new patterns.

Rather than fight to regain control, he decided to start fresh with Betterleaks.

The mission was clear: build a detection engine that could handle modern secret patterns, encoding techniques, and real-world breach data. Not just rely on entropy, but actually understand what a secret looks like using techniques borrowed from LLMs, cryptography, and actual credential datasets.

The breakthrough came from three innovations:

1. BPE Tokenization: Teaching Machines to Understand Secrets

Instead of character-level entropy analysis, Betterleaks uses Byte-Pair Encoding (BPE) tokenization—the same technique that powers large language models.

BPE works by learning which character sequences appear frequently in secrets. Rather than analyzing character randomness, it recognizes patterns that appear in real credentials:

  • AWS keys have consistent token patterns (format: ASIA... or AKIA...)
  • JWTs have . separators and base64 encoding
  • Slack tokens start with xoxb-, xoxp-, or xocb-
  • Database passwords often contain special characters in specific positions

Betterleaks trained on CredData—a dataset of real exposed credentials from public breaches—and learned BPE tokens that represent actual secret patterns.

The results were dramatic:

  • Gitleaks Shannon entropy detection: ~78% recall on CredData
  • Betterleaks BPE tokenization: 98.6% recall on CredData

This meant Betterleaks caught secrets that Gitleaks missed, with fewer false positives because it understood what a real secret looked like.

2. CEL Validation: Rules Written in Code

The second innovation was integrating Common Expression Language (CEL) for validation rules. Rather than just flagging high-entropy strings, Betterleaks allows you to write expressive rules that check for semantic properties:

# .betterleaks.toml example
[[validators]]
name = "aws_access_key"
pattern = "AKIA[0-9A-Z]{16}"
validations = [
  "size(token) == 20",
  "token.matches('[A-Z0-9]{16}')",
]

[[validators]]
name = "slack_token"
pattern = "xox[bapws]-[0-9]{10,13}-[0-9]{10,13}-[a-zA-Z0-9]{24,26}"
validations = [
  "token.startsWith('xox')",
  "token.contains('-')",
]

[[validators]]
name = "jwt_token"
pattern = "[A-Za-z0-9_-]{20,}\\.[A-Za-z0-9_-]{20,}\\.[A-Za-z0-9_-]{20,}"
validations = [
  "token.split('.').size() == 3",
  "base64url_decode(token.split('.')[1]).contains('exp')",
]

CEL rules can validate:

  • Format constraints (length, character sets, structure)
  • Algorithm parameters (cryptographic properties)
  • Logical consistency (checksums, structural integrity)
  • Temporal constraints (token expiration)

A secret that passes both BPE tokenization and CEL validation is flagged with much higher confidence.

3. Handling Encoded Secrets: The Multi-Layer Approach

The third innovation was handling secrets that had been encoded—sometimes multiple times.

Betterleaks doesn't just detect raw secrets; it recursively decodes common encodings:

1. Detect base64-encoded string
2. Decode it
3. Check if decoded string matches secret pattern
4. If not, try hex-decoding
5. If not, try URL-decoding
6. If not, try gzip decompression
7. Repeat for up to 3 layers

This catches attackers who thought they could hide a secret by encoding it twice:

Original secret:  AKIAIOSFODNN7EXAMPLE
First encoding:   QUtJQUlPU0ZPREROU043RVhBTVBMRQ==  (base64)
Second encoding:  UUt0SUFJTy9GRkRST043RVhBTVBMRQ==  (base64 of base64)

Betterleaks would detect both layers and flag the original secret.

Technical Deep Dive: How Betterleaks Actually Works

Let's walk through the detection pipeline:

Phase 1: Content Collection

Betterleaks scans git history (or file systems):

betterleaks scan --repo . --all-history

It extracts every string longer than 8 characters from:

  • Commit diffs
  • File contents
  • Commit messages (often contain secrets by accident)

Phase 2: Tokenization and Entropy

Each string is:

  1. Tokenized using BPE: Mapped to learned secret token sequences
  2. Entropy calculated: Shannon entropy + token-level entropy
  3. Compared to thresholds: Secrets typically score > 0.85 on combined metrics

Strings that pass are moved to Phase 3.

Phase 3: Pattern Matching and CEL Validation

Candidates are matched against known secret patterns (AWS, Google, Azure, Slack, etc.). If they match, CEL validators run:

Input: "AKIAIOSFODNN7EXAMPLE"
Pattern match: ✓ (matches AKIA[0-9A-Z]{16})
CEL validation:
  - size(token) == 20 ✓
  - token.matches('[A-Z0-9]{16}') ✓
Result: CONFIRMED SECRET

Phase 4: Encoding Detection and Decoding

If a string looks like a secret but doesn't match known patterns, Betterleaks attempts decoding:

Input: "QUtJQUlPU0ZPREROU043RVhBTVBMRQ=="
Detected encoding: base64
Decoded: "AKIAIOSFODNN7EXAMPLE"
Pattern match: ✓ (matches AKIA pattern)
Result: CONFIRMED SECRET (base64-encoded)

Phase 5: Categorization and Risk Scoring

Secrets are categorized (API key, password, token, etc.) and scored by risk:

High: Database passwords, AWS keys, encryption keys
Medium: API tokens, OAuth credentials, SSH keys
Low: Temporary tokens, test credentials, expired keys

Migration Guide: From Gitleaks to Betterleaks

If you're using Gitleaks today, migrating to Betterleaks is straightforward because Betterleaks maintains compatibility with Gitleaks configuration:

Step 1: Install Betterleaks

# macOS
brew install betterleaks

# Linux (Debian/Ubuntu)
sudo apt-get install betterleaks

# Or use Docker
docker run -v $(pwd):/repo betterleaks scan /repo

Step 2: Migrate Gitleaks Config (Optional)

Your existing .gitleaks.toml configuration will work with Betterleaks, but you can enhance it with CEL validators:

# Old Gitleaks config (still works)
[[rules]]
id = "aws-access-key"
pattern = "AKIA[0-9A-Z]{16}"

# Enhanced Betterleaks config
[[validators]]
name = "aws_access_key"
pattern = "AKIA[0-9A-Z]{16}"
validations = [
  "size(token) == 20",
  "token.matches('[A-Z0-9]{16}')",
]
entropy_threshold = 3.5

Step 3: Run Your First Scan

# Scan current commit
betterleaks scan --staged

# Scan entire history
betterleaks scan --all-history

# Scan with detailed output
betterleaks scan --verbose --report-format json > secrets.json

Step 4: Review and Allowlist (If Needed)

Some findings will be false positives (UUIDs in test fixtures, hashed passwords, etc.). Add them to an allowlist:

# .betterleaks.toml
[allowlist]
patterns = [
  "550e8400-e29b-41d4-a716-446655440000",  # Test UUID
]
commits = [
  "abc123def456",  # Commit with intentional test secret
]

CI/CD Integration: Real-World Examples

GitHub Actions with Betterleaks

name: Secret Scanning

on:
  pull_request:
  push:
    branches: [main, develop]

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history

      - name: Install Betterleaks
        run: |
          curl -L https://github.com/noahzucker/betterleaks/releases/download/v1.0.0/betterleaks_linux_amd64 -o /tmp/betterleaks
          chmod +x /tmp/betterleaks

      - name: Scan for secrets
        run: |
          /tmp/betterleaks scan . \
            --all-history \
            --report-format json \
            --output results.json \
            --fail-on-high

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: secret-scan-results
          path: results.json

      - name: Comment on PR
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('results.json', 'utf8'));
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `⚠️ Secrets detected in this PR:\n\n${JSON.stringify(results, null, 2)}`
            });

GitLab CI with Betterleaks

secret_scanning:
  stage: security
  image: ubuntu:latest
  before_script:
    - curl -L https://github.com/noahzucker/betterleaks/releases/download/v1.0.0/betterleaks_linux_amd64 -o /usr/local/bin/betterleaks
    - chmod +x /usr/local/bin/betterleaks
  script:
    - betterleaks scan . --all-history --report-format json --output secrets.json
  artifacts:
    reports:
      sast: secrets.json
    paths:
      - secrets.json
    when: always
  allow_failure: false

Pre-commit Hook Integration

#!/bin/bash
# .git/hooks/pre-commit

betterleaks scan --staged --report-format json > /tmp/scan.json

if [ $? -ne 0 ]; then
  echo "❌ Secrets detected in staged changes:"
  cat /tmp/scan.json | jq '.Findings[] | "\(.File): \(.Match)"'
  echo ""
  echo "Fix the following before committing:"
  echo "1. Remove the secret from the code"
  echo "2. Run: git reset --soft HEAD~1 (if already committed)"
  echo "3. Regenerate/rotate the exposed credential"
  exit 1
fi

Plumber: Complementary CI/CD Security

Betterleaks handles secrets in code, but secrets also leak through CI/CD logs, environment variables, and build artifacts. That's where Plumber comes in.

Plumber scans your entire CI/CD pipeline:

  • GitHub Actions logs (artifacts, cache, build output)
  • GitLab CI logs and artifacts
  • Environment variables in CI/CD configuration
  • Docker image layers
  • Docker registries
# Scan GitHub Actions logs
plumber scan github --repo owner/repo --token $GITHUB_TOKEN

# Scan Docker images
plumber scan docker --image myrepo/myapp:latest

# Scan GitLab CI artifacts
plumber scan gitlab --project-id 12345 --token $GITLAB_TOKEN

Plumber and Betterleaks work together:

  1. Betterleaks: Prevents secrets from entering code
  2. Plumber: Catches secrets that leaked through CI/CD
  3. Combined: Defense in depth against credential compromise

Comparison: Betterleaks vs Gitleaks vs TruffleHog vs detect-secrets

Feature Betterleaks Gitleaks TruffleHog detect-secrets
Detection Method BPE tokenization + CEL validation Shannon entropy Graph-based entropy + API searches Atomic secret patterns
CredData Recall 98.6% ~78% 94% 89%
False Positive Rate ~2.1% ~5-8% ~1.8% ~3.2%
Encoded Secret Detection Multi-layer decoding (3+) Single-layer Limited No
Performance Fast (linear) Fast (linear) Slower (API calls) Fast
Active Maintenance Yes (2026) Unmaintained Yes Yes
Custom Rules (CEL) Yes Yes (regex only) Yes No
CI/CD Integration Native Mature Native Good
Cloud API Searches Planned (LLM phase) No Yes No
Cost Free + open source Free + open source Free + open source Free + open source
Learning Curve Low-Medium Low Medium Low

When to use Betterleaks:

  • High-security environments
  • Detecting sophisticated/encoded secrets
  • Migrating from Gitleaks
  • Need modern CEL-based rules

When to use Gitleaks:

  • Existing mature deployments (if maintained)
  • Legacy systems with established configs
  • Simplicity over advanced features

When to use TruffleHog:

  • Need to verify if secrets are active (API checking)
  • Scanning public repositories extensively
  • Want third-party verification

The Future: LLM-Assisted Classification and Auto-Revocation

Betterleaks' roadmap for 2026 includes two major features that will shift secret scanning from detection to prevention:

LLM-Assisted Classification

Using Claude or similar models to classify detected secrets and suggest revocation:

Detected: "sk_live_4eC39HqLyjWDarht..."
Classification: Stripe API key (production)
Confidence: 99.2%
Risk Level: CRITICAL
Recommended Action: Revoke immediately
Suggested PR: Remove from line 47 of payment.js

Rather than just flagging a secret, the system will understand its context (production vs test, age, usage patterns) and provide intelligent remediation steps.

Auto-Revocation APIs

Integration with secret managers and provider APIs:

# .betterleaks.toml (future)
[auto_revoke]
enabled = true
providers:
  - type: aws
    assume_role = "arn:aws:iam::123456789:role/SecretRevoke"
  - type: github
    token_env = "GITHUB_TOKEN"
  - type: stripe
    api_key_env = "STRIPE_API_KEY"

When a secret is detected, Betterleaks can automatically:

  1. Notify the team
  2. Revoke the credential with the provider
  3. Rotate to a new secret
  4. Update applications with the new secret
  5. Create an incident ticket

This moves from "we found a secret" to "we found, revoked, and replaced a secret" in minutes instead of days.

Building Your Secret Scanning Strategy

Modern secret scanning isn't just about running a tool once. It requires a layered approach:

Layer 1: Pre-commit Prevention

pip install detect-secrets pre-commit-hook
# Runs on every commit, blocks suspicious code locally

Layer 2: CI/CD Gating

# Betterleaks in GitHub Actions/GitLab CI
# Blocks merge of PRs with detected secrets

Layer 3: Continuous Monitoring

# Daily scans of main branch, production codebase
betterleaks scan --repo . --all-history --cron "0 2 * * *"

Layer 4: Secret Rotation

# Annual/quarterly rotation of all active credentials
# Even if not detected as leaked

Layer 5: Incident Response

# When a secret is found to be compromised:
# 1. Revoke immediately
# 2. Audit access logs
# 3. Notify affected systems
# 4. Update configurations

Conclusion: Why This Matters More Than Ever

Secret scanning has come full circle from "nice to have" to "absolutely critical." The tools have evolved from simple entropy analysis (Gitleaks) to intelligent pattern recognition (Betterleaks) to LLM-assisted classification (near future).

Zach Rice didn't create Betterleaks because Gitleaks was broken—he created it because the threat landscape evolved faster than the tool could adapt. Attackers became more sophisticated. Encodings became more complex. Real-world breaches revealed patterns that statistical analysis missed.

The result is a new generation of secret scanning that:

  • Catches 98.6% of real exposed credentials (vs 78% before)
  • Understands encoding tricks and evasion techniques
  • Integrates seamlessly into modern DevSecOps pipelines
  • Plans for automated revocation and rotation

If you're still relying on Gitleaks alone, or worse—not scanning at all—now is the time to upgrade. The migration from Gitleaks to Betterleaks takes an afternoon. The cost of a compromise from a detected-but-ignored secret costs millions.

Start today: install Betterleaks, run your first scan, and build secret scanning into your CI/CD. Your infrastructure will thank you.


Key Takeaways:

  1. Gitleaks pioneered entropy-based secret scanning but had fundamental limitations with encoding and semantic understanding
  2. Betterleaks uses BPE tokenization + CEL validation for 98.6% recall on real credentials
  3. Encoded secrets can be detected through recursive multi-layer decoding
  4. CI/CD integration is straightforward with GitHub Actions, GitLab CI, and pre-commit hooks
  5. The future is automated revocation—from detection to remediation in minutes
  6. Defense in depth requires layering pre-commit, CI/CD, continuous monitoring, and rotation