eBPF for SREs: A Production Profiling Guide

Introduction

Production profiling has always been the domain of the brave. For years, SREs faced an uncomfortable tradeoff: either attach heavyweight profiling tools that introduced measurable latency and risked destabilizing production workloads, or fly blind and rely on metrics dashboards that showed symptoms but never root causes. The emergence of eBPF as a mainstream Linux kernel technology has fundamentally changed this calculus. With eBPF, you can instrument nearly every layer of the kernel and user space with negligible overhead, generating the kind of deep observability data that was previously only available in staging environments.

eBPF, which stands for extended Berkeley Packet Filter, evolved from its origins as a network packet filtering mechanism into a general-purpose in-kernel virtual machine. Programs written for eBPF are verified by the kernel before execution, ensuring they cannot crash the system, enter infinite loops, or access unauthorized memory. This safety guarantee is what makes eBPF uniquely suited for production profiling. You can attach probes to kernel functions, tracepoints, user-space function entries, and hardware performance counters without recompiling the kernel or restarting services.

This guide covers the practical workflows that SRE teams use daily to diagnose performance issues in production Linux systems. We will walk through CPU profiling, latency analysis, memory leak detection, network tracing, and I/O profiling using the standard eBPF toolchain. Every command and script shown here has been used in production environments running kernel 5.15 and newer.

Why eBPF Changes Production Profiling

Traditional profiling tools impose costs that make them impractical in production. Running strace on a busy service can slow it down by a factor of 100x or more because strace uses ptrace to intercept every syscall, context-switching between the traced process and the tracing process for each invocation. Even perf, which uses hardware performance counters and is far lighter than strace, still requires writing sample data to disk and can generate substantial I/O pressure under high sample rates.

eBPF changes the equation in three important ways. First, eBPF programs run inside the kernel, eliminating the context-switch overhead of ptrace-based tools. When you attach a kprobe or tracepoint, the eBPF program executes inline with the kernel code path, typically adding only tens of nanoseconds per invocation. Second, eBPF programs can aggregate data in-kernel using maps, which means you can compute histograms, counts, and summaries without copying every event to user space. A latency histogram that would generate millions of events per second with strace produces a single map read per interval with eBPF. Third, the eBPF verifier guarantees safety: your program cannot dereference null pointers, access out-of-bounds memory, or loop indefinitely.

The practical impact is dramatic. Tools like biolatency can trace every block I/O request on a system handling hundreds of thousands of IOPS, producing a latency histogram with less than 1% CPU overhead. You can run funclatency against a hot function in your application server while serving peak traffic. This was simply not possible with previous generations of tracing tools.

Setting Up the eBPF Toolchain

The eBPF ecosystem has consolidated around three primary toolsets: bcc-tools, bpftrace, and libbpf-based CO-RE programs. Each serves a different use case, and a well-equipped SRE workstation should have all three available.

On Ubuntu and Debian systems, install the full toolchain:

sudo apt-get update
sudo apt-get install -y bpfcc-tools bpftrace linux-headers-$(uname -r)
sudo apt-get install -y libbpf-dev bpftool

On RHEL and Fedora systems:

sudo dnf install -y bcc-tools bpftrace kernel-devel-$(uname -r)
sudo dnf install -y libbpf-devel bpftool

Verify your kernel supports BTF, which is required for modern bpftrace features and CO-RE programs:

ls /sys/kernel/btf/vmlinux
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | head -c 100

If BTF is not available, you will need to ensure your kernel was compiled with CONFIG_DEBUG_INFO_BTF=y. Most distribution kernels from 2022 onward include BTF support.

The bcc-tools package provides dozens of ready-made tools that cover the most common profiling scenarios. These are typically installed as executables with a -bpfcc suffix on Debian-based systems or directly under /usr/share/bcc/tools/ on RHEL. Bpftrace provides a high-level scripting language for writing custom one-liners and short scripts. Libbpf and CO-RE (Compile Once, Run Everywhere) programs are used for building production-grade, portable eBPF tools that ship as self-contained binaries.

CPU Profiling Workflows

CPU profiling is the most common starting point when investigating performance issues. The goal is to identify which functions consume the most CPU time, whether in kernel space or user space, and generate flame graphs that make hotspots visually obvious.

The simplest approach uses the bcc profile tool to sample stack traces at a fixed frequency:

sudo profile-bpfcc -F 99 -a --stack-storage-size 16384 30 > /tmp/cpu-stacks.txt

This samples all CPUs at 99 Hz for 30 seconds. The 99 Hz frequency avoids aliasing with timer-based activity that often runs at 100 Hz. The output contains folded stack traces that can be fed directly into Brendan Gregg's FlameGraph tools:

git clone https://github.com/brendangregg/FlameGraph.git
cat /tmp/cpu-stacks.txt | FlameGraph/stackcollapse-bpf.pl | FlameGraph/flamegraph.pl > cpu-flame.svg

For more targeted profiling, bpftrace allows you to profile a specific process and filter by CPU:

sudo bpftrace -e 'profile:hz:99 /pid == 12345/ { @[ustack(perf), comm] = count(); }' > stacks.bt

When you need to understand CPU scheduling behavior rather than execution time, the cpudist tool shows how long threads run on-CPU before being descheduled:

sudo cpudist-bpfcc -p 12345 10 1

This prints a histogram of on-CPU durations for process 12345 over a 10-second interval. Short on-CPU times combined with high context switches suggest lock contention. Long on-CPU times with low throughput suggest computational bottlenecks.

For investigating CPU migration issues in NUMA systems, you can trace scheduler events:

sudo bpftrace -e 'tracepoint:sched:sched_migrate_task {
    printf("pid=%d comm=%s from_cpu=%d to_cpu=%d\n",
        args->pid, args->comm, args->orig_cpu, args->dest_cpu);
}'

Latency Analysis

Latency analysis is where eBPF truly shines because it can measure the time between arbitrary events without perturbing the measured code path. The bcc-tools collection includes several purpose-built latency tools.

Block I/O latency is measured with biolatency, which traces the time from block I/O request to completion:

sudo biolatency-bpfcc -D 10 1

The -D flag breaks down latency by disk device, making it easy to identify which drives are slow. Output is a power-of-two histogram showing the distribution of latencies in microseconds.

Run queue latency, which measures how long threads wait in the scheduler queue before getting CPU time, is measured with runqlat:

sudo runqlat-bpfcc -p 12345 10 1

High run queue latency means your processes are waiting for CPU, which indicates CPU saturation. If you see latencies above 10 milliseconds during normal operation, you either need more CPU capacity or need to investigate what is consuming CPU.

Function latency measures the execution time of a specific kernel or user-space function:

sudo funclatency-bpfcc -p 12345 'c:malloc' 10 1

This traces malloc calls in the libc of process 12345 and shows a latency histogram. You can trace any function that has a symbol in the binary or shared library. For kernel functions:

sudo funclatency-bpfcc 'vfs_read' 10 1

For application-level latency tracing with bpftrace, you can measure the time between two probe points:

sudo bpftrace -e '
uprobe:/usr/bin/myapp:process_request { @start[tid] = nsecs; }
uretprobe:/usr/bin/myapp:process_request /@start[tid]/ {
    @latency_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}
END { clear(@start); }
'

Memory Analysis

Memory issues in production range from gradual leaks that cause OOM kills over days to cache inefficiencies that degrade performance. eBPF provides several tools for each category.

The memleak tool traces memory allocation and free calls, tracking outstanding allocations to identify leaks:

sudo memleak-bpfcc -p 12345 --combined-only 30

This attaches to process 12345 and after 30 seconds prints stack traces of allocations that were not freed, sorted by total outstanding bytes. This is invaluable for catching leaks in long-running services without restarting them with specialized allocator debugging enabled.

For kernel memory allocation tracking:

sudo memleak-bpfcc --combined-only 30

Without a pid flag, memleak traces kernel allocations via kmalloc and kfree, which can identify kernel memory leaks caused by drivers or kernel modules.

Cache behavior has an enormous impact on application performance. The cachestat tool provides per-second statistics about the page cache:

sudo cachestat-bpfcc 5

Output includes hits, misses, dirty pages, and read/write ratios. A high miss ratio indicates your working set exceeds available memory, and you should consider whether your application needs more RAM or better access patterns.

OOM kills can be traced in real time to understand which processes are being killed and why:

sudo bpftrace -e 'kprobe:oom_kill_process {
    printf("OOM kill: pid=%d comm=%s pages=%d\n",
        ((struct task_struct *)arg1)->pid,
        ((struct task_struct *)arg1)->comm,
        arg0);
}'

For a simpler approach, the bcc oomkill tool captures this automatically:

sudo oomkill-bpfcc

This prints a line each time the OOM killer is invoked, including the triggered and killed process details.

Network Tracing

Network performance issues are notoriously difficult to diagnose because they involve interactions between the application, the kernel TCP stack, and the network infrastructure. eBPF provides surgical tools for each layer.

The tcplife tool traces TCP session lifetimes, showing when connections are established and closed along with bytes transferred:

sudo tcplife-bpfcc -D

The -D flag includes timestamps. Each line shows the PID, process name, local and remote addresses, ports, duration, and bytes sent/received. This is essential for identifying connection churn, unexpected short-lived connections, or services that hold connections open longer than expected.

TCP retransmissions are a critical indicator of network health:

sudo tcpretrans-bpfcc -l

The -l flag includes tail loss probes. Each retransmission event is printed with the source and destination addresses, state, and the kernel function that triggered the retransmission. Clusters of retransmissions to a specific destination indicate network path issues.

Dropped packets at the TCP layer can be traced with tcpdrop:

sudo tcpdrop-bpfcc

Each dropped packet is printed with its kernel stack trace, which tells you exactly why the kernel dropped the packet. Common causes include socket buffer overflow, connection resets, and memory pressure.

For detailed TCP connection state analysis, you can write a bpftrace script that traces state transitions:

sudo bpftrace -e '
tracepoint:tcp:tcp_set_state {
    printf("pid=%d sport=%d dport=%d oldstate=%d newstate=%d\n",
        pid, args->sport, args->dport, args->oldstate, args->newstate);
}
'

DNS resolution latency is often an overlooked source of application latency. You can trace DNS lookups at the resolver level:

sudo bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:getaddrinfo { @start[tid] = nsecs; }
uretprobe:/lib/x86_64-linux-gnu/libc.so.6:getaddrinfo /@start[tid]/ {
    @dns_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}
'

I/O Profiling

Storage I/O is one of the most common sources of production latency, and eBPF provides tools that trace I/O at multiple levels of the storage stack.

The biosnoop tool traces individual block I/O operations with timestamps, latencies, and process attribution:

sudo biosnoop-bpfcc -d sda 10

This traces all block I/O to the sda device for 10 seconds, showing each operation's PID, process name, disk, operation type, sector, bytes, and latency. This is the go-to tool when you need to understand exactly what is generating I/O on a specific disk.

Filesystem-level tracing provides higher-level context. The ext4slower tool traces ext4 operations that exceed a latency threshold:

sudo ext4slower-bpfcc 10

This prints all ext4 operations slower than 10 milliseconds, including reads, writes, opens, and syncs. This immediately identifies slow filesystem operations without requiring you to correlate block-level traces with filesystem metadata.

For general filesystem tracing across all filesystem types, fileslower serves the same purpose:

sudo fileslower-bpfcc 10

Write patterns are critical for understanding how applications interact with storage. The bpftrace filetop equivalent shows which files are being read and written most frequently:

sudo bpftrace -e '
tracepoint:syscalls:sys_enter_write {
    @writes[comm, pid] = count();
}
interval:s:5 { print(@writes); clear(@writes); }
'

For fsync-heavy workloads like databases, tracing sync operations can reveal unexpected flush patterns:

sudo bpftrace -e '
kprobe:vfs_fsync_range {
    @fsync_latency[comm] = count();
}
kretprobe:vfs_fsync_range {
    printf("fsync completed: comm=%s\n", comm);
}
'

Building Custom bpftrace One-Liners

The real power of eBPF comes from the ability to write custom tracing programs tailored to your specific stack. Bpftrace's scripting language makes this accessible to anyone who can write a shell script.

The basic structure of a bpftrace one-liner is a probe specification followed by an action block. Probes can attach to kprobes (kernel functions), uprobes (user-space functions), tracepoints (stable kernel events), and software events.

Count syscalls by process:

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

Histogram of read sizes by process:

sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
    @read_bytes[comm] = hist(args->ret);
}'

Trace new process creation with full command lines:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("exec: pid=%d ppid=%d %s\n", pid, curtask->real_parent->pid, str(args->filename));
}'

Trace file opens with full paths:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("open: pid=%d comm=%s file=%s\n", pid, comm, str(args->filename));
}'

Measure the time your application spends in a specific function, aggregated by call stack:

sudo bpftrace -e '
uprobe:/opt/myapp/bin/server:DatabaseQuery { @start[tid] = nsecs; }
uretprobe:/opt/myapp/bin/server:DatabaseQuery /@start[tid]/ {
    @query_ns[ustack(5)] = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}
'

For high-frequency events, use maps to aggregate in-kernel and avoid overwhelming the output buffer:

sudo bpftrace -e '
tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
    @total_bytes[comm] = sum(args->ret);
    @total_calls[comm] = count();
}
interval:s:10 {
    printf("\n--- Top readers (10s window) ---\n");
    print(@total_bytes, 10);
    print(@total_calls, 10);
    clear(@total_bytes);
    clear(@total_calls);
}
'

Continuous Profiling Integration

While ad-hoc profiling with bpftrace and bcc-tools is essential for incident response, continuous profiling provides always-on performance visibility. Two open-source projects lead this space: Parca and Pyroscope.

Parca uses eBPF to continuously sample stack traces across all processes on a host with minimal overhead. The Parca agent runs as a DaemonSet in Kubernetes or a systemd service on bare metal:

sudo parca-agent --remote-store-address=parca-server:7070 \
  --node=production-host-01 \
  --sampling-ratio=1.0 \
  --http-address=:7071

The agent automatically discovers all running processes, resolves symbols from debug information and BTF data, and streams profiling data to the Parca server. The server stores profiles in a columnar format optimized for time-series queries and provides a UI for exploring flame graphs, comparing profiles across time windows, and identifying regressions.

Pyroscope takes a similar approach but supports additional profiling modes beyond CPU, including memory allocation profiling, lock contention profiling, and goroutine profiling for Go applications:

# pyroscope-agent.yaml
server-address: http://pyroscope-server:4040
log-level: info
targets:
  - service-name: api-server
    spy-name: ebpfspy
    application-name: api-server
  - service-name: worker
    spy-name: ebpfspy
    application-name: worker-pool

Both tools integrate with Grafana for dashboard creation and alerting. A typical continuous profiling setup includes:

# Grafana data source configuration for Parca
curl -X POST http://grafana:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Parca",
    "type": "parca",
    "url": "http://parca-server:7070",
    "access": "proxy"
  }'

The real value of continuous profiling emerges over time. When a performance regression is deployed, you can compare the current profile against a baseline from before the deployment and immediately see which functions are consuming more CPU. This transforms performance debugging from a reactive investigation into a proactive detection system.

Best Practices and Safety in Production

eBPF's safety guarantees do not mean you can run any eBPF program on any production system without thought. While the verifier prevents kernel crashes, poorly written eBPF programs can still consume excessive CPU, generate overwhelming output, or interfere with system performance.

Always set time limits on bpftrace and bcc-tools. Every profiling session should have an explicit duration:

# Good: explicit 30-second duration
sudo biolatency-bpfcc 30 1

# Dangerous: runs until manually stopped
sudo biolatency-bpfcc

Be cautious with high-frequency probes. Attaching a kprobe to a function that fires millions of times per second will add measurable overhead even if the eBPF program itself is trivial. Before attaching a probe in production, estimate the frequency:

sudo bpftrace -e 'kprobe:vfs_read { @count = count(); } interval:s:1 { print(@count); clear(@count); }'

If the function fires more than a few hundred thousand times per second, consider whether you can use a tracepoint instead of a kprobe, add a filter to reduce the number of events processed, or sample rather than trace every event.

Use the bpftrace --dry-run flag to validate scripts before execution:

sudo bpftrace --dry-run -e 'kprobe:vfs_read { @[comm] = count(); }'

Limit map sizes to prevent unbounded memory growth:

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' --map-max-entries=10000

For production systems, establish a runbook of pre-approved eBPF tools and scripts. The bcc-tools collection is well-tested and safe for production use. Custom bpftrace scripts should be reviewed by the team and tested in staging before production use.

Consider running eBPF tools inside containers with appropriate capabilities rather than with full root access:

docker run --rm -it --privileged \
  -v /sys/kernel/debug:/sys/kernel/debug:ro \
  -v /sys/kernel/btf:/sys/kernel/btf:ro \
  -v /proc:/proc:ro \
  quay.io/iovisor/bpftrace:latest \
  bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

Finally, document your profiling workflows. When an incident occurs at 3 AM, you do not want to be writing bpftrace scripts from scratch. Maintain a repository of tested profiling scripts organized by symptom: high CPU, high latency, memory growth, network errors, and I/O saturation. Each script should include comments explaining what it measures, expected overhead, and how to interpret the output. eBPF gives you superpowers in production, but only if you have practiced using them before the emergency arrives.