Recent from talks
Contribute something
Nothing was collected or created yet.
Perf (Linux)
View on Wikipedia| perf | |
|---|---|
| Repository | https://github.com/torvalds/linux/tree/master/tools/perf |
| Written in | C |
| Operating system | Linux kernel |
| Type | Performance monitor and testing |
| License | GNU GPL |
| Website | perf |
perf (sometimes called perf_events[1] or perf tools, originally Performance Counters for Linux, PCL)[2] is a performance analyzing tool in Linux, available from Linux kernel version 2.6.31 in 2009.[3] Userspace controlling utility, named perf, is accessed from the command line and provides a number of subcommands; it is capable of statistical profiling of the entire system (both kernel and userland code).
It supports hardware performance counters, tracepoints, software performance counters (e.g. hrtimer), and dynamic probes (for example, kprobes or uprobes).[4] In 2012, two IBM engineers recognized perf (along with OProfile) as one of the two most commonly used performance counter profiling tools on Linux.[5]
Implementation
[edit]The interface between the perf utility and the kernel consists of only one syscall and is done via a file descriptor and a mapped memory region.[6] Unlike LTTng or older versions of oprofile, no service daemons are needed, as most functionality is integrated into the kernel. The perf utility dumps raw data from the mapped buffer to disk when the buffer becomes filled up. According to R. Vitillo (LBNL), profiling performed by perf involves a very low overhead.[6]
As of 2010[update], architectures that provide support for hardware counters include x86, PowerPC64, UltraSPARC (III and IV), ARM (v5, v6, v7, Cortex-A8 and -A9), Alpha EV56 and SuperH.[4] Usage of Last Branch Records,[7] a branch tracing implementation available in Intel CPUs since Pentium 4, is available as a patch.[6] Since version 3.14 of the Linux kernel mainline, released on 31 March 2014, perf also supports running average power limit (RAPL) for power consumption measurements, which is available as a feature of certain Intel CPUs.[8][9][10]
Perf is natively supported in many popular Linux distributions, including Red Hat Enterprise Linux (since its version 6 released in 2010)[11] and Debian in the linux-tools-common package (since Debian 6.0 (Squeeze) released in 2011).[12]
Subcommands
[edit]perf is used with several subcommands:
stat: measure total event count for single program or for system for some timetop: top-like dynamic view of hottest functionsrecord: measure and save sampling data for single program[13]report: analyze file generated by perf record; can generate flat, or graph profile.[13]annotate: annotate sources or assemblysched: tracing/measuring of scheduler actions and latencies[14]list: list available events
Criticism
[edit]The documentation of perf is not very detailed (as of 2014); for example, it does not document most events or explain their aliases (often external tools are used to get names and codes of events[15]).[16] Perf tools also cannot profile based on true wall-clock time,[16] something that has been addressed by the addition of off-CPU profiling.
Security
[edit]The perf subsystem of Linux kernels from 2.6.37 up to 3.8.8 and RHEL6 kernel 2.6.32 contained a security vulnerability (CVE-2013-2094), which was exploited to gain root privileges by a local user.[17][18] The problem was due to an incorrect type being used (32-bit int instead of 64-bit) in the event_id verification code path.[19]
See also
[edit]References
[edit]- ^ Vince Weaver, The Unofficial Linux Perf Events Web-Page
- ^ Linux perf event Features and Overhead // 2013 FastPath Workshop, Vince Weaver
- ^ Jake Edge, Perfcounters added to the mainline, LWN July 1, 2009, "perfcounters being included into the mainline during the recently completed 2.6.31 merge window"
- ^ a b Arnaldo Carvalho de Melo, The New Linux ’perf’ tools, presentation from Linux Kongress, September, 2010
- ^ A. Zanella, R. Arnold. Evaluate performance for Linux on POWER. Analyze performance using Linux tools, 12 Jun 2012 // IBM DeveloperWorks Technical library
- ^ a b c Roberto A. Vitillo (LBNL). PERFORMANCE TOOLS DEVELOPMENTS, 16 June 2011, presentation from "Future computing in particle physics" conference
- ^ Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2. Intel. June 2009. p. 19-2 vol. 3.
- ^ Jake Edge (2014-04-09). "Lots of new perf features". LWN.net. Retrieved 2014-04-22.
- ^ Jacob Pan (2013-04-02). "RAPL (Running Average Power Limit) driver". LWN.net. Retrieved 2014-04-22.
- ^ "kernel/git/torvalds/linux.git - Linux kernel source tree". Git.kernel.org. 2014-01-20. Retrieved 2014-03-31.
- ^ 6.4. Performance Counters for Linux (PCL) Tools and perf // RHEL Developer Guide
- ^ "Debian - Details of package linux-tools-2.6.32 in squeeze". Packages.debian.org. Retrieved 2014-03-31.
- ^ a b Urs Fässler perf file format Archived 2012-12-14 at the Wayback Machine, CERN openlab, 2011
- ^ Ingo Molnar, 'perf sched': Utility to capture, measure and analyze scheduler latencies and behavior, 17 Sep 2009
- ^ How to monitor the full range of CPU performance events // Bojan Nikolic, 2012
- ^ a b Robert Haas (PostgreSQL), perf: the good, the bad, the ugly // 6 June 2012
- ^ Michael Larabel (2013-05-15). "New Linux Kernel Vulnerability Exploited". Phoronix.
- ^ corbet (2013-05-15). "Local root vulnerability in the kernel". LWN.
- ^ Joe Damato (2013-05-20). "A closer look at a recent privilege escalation bug in Linux (CVE-2013-2094)".
External links
[edit]- perf's wiki on kernel.org
- Arnaldo Carvalho de Melo, The New Linux ’perf’ tools, presentation from Linux Kongress, September, 2010
- Hardware PMU support charts – check perf_event column
- perf Examples by Brendan Gregg
Perf (Linux)
View on Grokipediaperf command serves as a modular interface, supporting subcommands such as perf stat for event counting, perf record for sampling profiles into data files, perf report for visualizing results, and perf top for real-time monitoring, all leveraging the kernel's perf_events API for low-overhead data collection.[1] Over time, perf has evolved to include advanced features like scripting support in Python and Perl, integration with eBPF for custom probes, and architecture-specific extensions, making it a standard tool for kernel developers, system administrators, and performance engineers.[1] By default configurable via the CONFIG_PERF_EVENTS kernel option, perf requires appropriate privileges (e.g., root or membership in the perf_users group) to access sensitive hardware counters and prevent information leaks.[3]
Overview
Definition and Purpose
Perf is a command-line performance analysis tool integrated into the Linux kernel, enabling the profiling of CPU, memory, I/O, and other system events through hardware performance monitoring units (PMUs), software counters, tracepoints, and dynamic tracing mechanisms such as kprobes and uprobes.[4][5] It provides a unified interface to the perf_events kernel subsystem, allowing users to collect and analyze data on hardware-level events like instruction executions, cache misses, and branch predictions, as well as software-level events for broader system observability.[6][5] The primary purposes of perf include identifying performance bottlenecks in applications and the kernel, measuring overall system and workload efficiency, and facilitating kernel-level observability to debug and optimize resource utilization.[4][6] By supporting sampling-based profiling, it enables detailed insights into event frequencies and hotspots without requiring invasive instrumentation, thus aiding developers and system administrators in enhancing software performance across diverse workloads.[5][4] Originating from the Performance Counters for Linux (PCL) project, perf has evolved from a basic framework for hardware counter access into a comprehensive suite that encompasses both sampling and tracing functionalities, with ongoing development within the Linux kernel source tree.[6][4] Key benefits of perf include its low-overhead sampling approach, which minimizes perturbation to the system being analyzed, support for multiple architectures such as x86, ARM, and PowerPC, and extensibility through dynamic probes and scripting capabilities that allow customization for specific analysis needs.[6][4][7][8]Historical Development
The perf_events subsystem originated in 2009 as a unified interface for accessing hardware performance counters in the Linux kernel, developed primarily by Thomas Gleixner and Ingo Molnar in response to earlier proposals like perfmon2. This effort addressed the fragmentation in performance monitoring tools by providing a standardized kernel API that supported sampling, counting, and multiplexing of events across diverse architectures. The subsystem was integrated into the mainline Linux kernel with version 2.6.31, released in September 2009, marking the debut of the associated userspace tool known initially as "perf".[9][10] Prior to mainline inclusion, the project was known as Performance Counters for Linux (PCL), but it underwent a significant rename to perf_events to better reflect its expanded scope beyond mere counters to a broader performance events framework. This renaming occurred in September 2009, just before the 2.6.31 release, and facilitated its acceptance by emphasizing compatibility with existing kernel tracing infrastructure. Key early contributions came from core kernel developers, including patches for event handling and syscall integration, which laid the foundation for subsequent enhancements.[11][12] Major enhancements followed rapidly, with dynamic Performance Monitoring Unit (PMU) support added in December 2010 through kernel commit 2e80a82a, enabling runtime registration of PMU types for greater flexibility across hardware vendors. The integration of Berkeley Packet Filter (BPF) capabilities began in the early 2010s, with initial tracing-related features like uprobes merged in kernel 3.5 (2012), paving the way for programmable event processing. By 2015, extended BPF (eBPF) enhancements allowed custom tracing programs to output data directly to perf events, revolutionizing kernel observability by enabling safe, efficient user-defined probes and summaries without modifying kernel code.[13] In the 2020s, development shifted toward scalability for cloud environments and multi-core systems, with optimizations for high-core-count processors and distributed tracing to handle the demands of modern datacenters. Brendan Gregg, a prominent kernel tracing expert, contributed extensively to these areas through tools like bpftrace and extensions integrating perf with eBPF for advanced observability. As of late 2024, updates in Linux kernel 6.12 included perf improvements for emerging hardware, such as enhanced PMU support for Intel Lunar Lake and Arrow Lake processors, enabling better profiling for compute-intensive workloads. Development has continued into 2025, with kernel 6.17 (released September 2025) incorporating further refinements to perf_events for ongoing hardware support and observability features.[14][15]Architecture
Kernel Infrastructure
The perf_events subsystem forms the core kernel framework in Linux for accessing hardware performance monitoring units (PMUs) and software events, enabling the collection of performance data through a unified interface. Introduced in Linux kernel version 2.6.31, it abstracts the differences between various CPU architectures and PMU implementations, allowing tools to monitor events like CPU cycles, cache misses, and system calls without direct hardware-specific programming.[16] This subsystem manages event allocation, scheduling, and data delivery, ensuring compatibility across x86, ARM, and other architectures.[16] The primary kernel interface is the perf_event_open(2) system call, which creates a file descriptor for an event session and supports parameters such as pid (process ID, e.g., -1 for system-wide monitoring), cpu (target CPU, e.g., -1 for any available CPU), and config (event-specific settings, e.g., PERF_COUNT_HW_CPU_CYCLES for hardware CPU cycle counting).[16] Additional parameters include type (e.g., PERF_TYPE_HARDWARE for PMU events or PERF_TYPE_SOFTWARE for kernel software events) and group_fd (for event grouping, e.g., -1 to create a new leader event).[16] This syscall handles event configuration via the perf_event_attr structure, which specifies sampling periods, frequency modes, and inheritance options, with capabilities evolving across kernel versions (e.g., dynamic PMU support added in 2.6.38).[16] Support for hardware counters is provided through PMU drivers, which expose vendor-specific features such as Intel's Precise Event-Based Sampling (PEBS) for low-overhead, precise instruction-level profiling and AMD's Instruction-Based Sampling (IBS) for detailed fetch and op execution analysis.[16] Software events, in contrast, are kernel-generated counters like page faults (PERF_COUNT_SW_PAGE_FAULTS) and context switches (PERF_COUNT_SW_CONTEXT_SWITCHES), offering insights into system behavior without relying on hardware.[16] Both types integrate seamlessly, with hardware events leveraging PMU capabilities for high-precision timing and software events providing aggregated kernel statistics.[16] Data capture occurs via a ring buffer mechanism, implemented using mmap(2) to map kernel pages into user space for low-latency transfer of sampled events, with metadata tracked in the perf_event_mmap_page structure including head and tail pointers for producer-consumer synchronization.[17] This design minimizes overhead by allowing asynchronous reads and overflow handling through signals or polling, while multiplexing enables rotation among events when hardware counters are limited (e.g., via time_enabled and time_running fields to normalize counts).[16][17] Scalability is enhanced by per-CPU buffers, where events can be bound to specific CPUs (cpu >= 0) for system-wide collection on multi-core systems, reducing contention and enabling parallel data gathering across processors.[17] Group events further support this by allowing multiple related counters (e.g., cycles and instructions) to be scheduled atomically as a unit, ensuring correlated sampling and synchronized enabling/disabling via ioctls like PERF_EVENT_IOC_ENABLE, which is critical for accurate ratio computations in performance analysis.[16]Userspace Components
The primary userspace component of perf is theperf binary, which serves as the command-line interface for performance analysis and monitoring. This executable is built from the tools/perf directory within the Linux kernel source tree, enabling users to interact with kernel performance events through syscalls like perf_event_open.[18] In most Linux distributions, the perf binary is distributed as part of the linux-tools package family; for instance, on Debian-based systems such as Ubuntu, it can be installed via apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r), ensuring compatibility with the running kernel.[19] The version of perf is typically aligned with the corresponding kernel version—for example, perf 6.12 accompanies Linux kernel 6.12—to maintain feature parity and avoid ABI mismatches.[20]
perf's userspace functionality is supported by several key libraries that handle specialized tasks such as event management, trace processing, and symbol resolution. The libperf library, located in tools/lib/perf, provides a high-level C API for accessing the kernel's perf events subsystem, including functions for opening events (perf_evsel__open), reading samples (perf_evsel__read), and memory-mapping buffers (perf_evsel__mmap), abstracting low-level syscall details for developers building custom tools.[21] libtraceevent, a separate library maintained under the kernel's libtrace umbrella, is essential for parsing and processing kernel trace events, enabling perf to decode raw trace data into human-readable formats during analysis. Additionally, libdw from the elfutils package facilitates DWARF-based unwinding of call stacks, allowing perf to resolve symbols and generate accurate stack traces from sampled data without requiring frame pointers in binaries.
Building perf from source requires specific dependencies to compile its userspace components fully. Kernel headers must be installed to access performance event definitions and syscall interfaces, while the elfutils development package (providing libelf for ELF file handling and libdw for debugging support) is mandatory for features like symbol resolution and unwinding.[22] Optional dependencies include Python development headers for enabling scripting capabilities, such as custom event processing scripts. The build process involves navigating to the tools/perf directory in the kernel source and running make, which configures and compiles the binary along with embedded libraries like libperf.[20]
perf's design emphasizes extensibility in userspace, allowing customization beyond its core features. Plugins can be developed to support alternative output formats, such as integrating with visualization tools or exporting data in proprietary schemas, by leveraging the plugin API in the perf build system. Furthermore, the perf script subcommand facilitates scripting extensibility, enabling users to process recorded event streams with custom Python or Perl scripts for tailored analysis, such as filtering events or generating reports.[23] This modular approach, combined with the libraries' APIs, supports integration into larger profiling workflows while maintaining a lightweight footprint.
Core Functionality
Event Monitoring and Sampling
Perf supports a variety of event types for monitoring system and application behavior. Hardware events, accessed via the PERF_TYPE_HARDWARE type, capture low-level CPU metrics such as cycles executed, instructions retired, cache references and misses, branch instructions and misses, bus cycles, and stalled cycles.[24] These events leverage on-chip performance monitoring units (PMUs) to provide direct hardware counters without significant software intervention. Software events, defined under PERF_TYPE_SOFTWARE, track kernel-level occurrences including CPU clock ticks, task clock time, page faults (major and minor), context switches, and CPU migrations.[24] Tracepoint events, using PERF_TYPE_TRACEPOINT, interface with static kernel probes to observe specific kernel subsystems, such as system calls (e.g., entry and exit points for functions like execve), scheduling decisions, and block I/O operations, by referencing IDs from the debugfs tracing/events hierarchy.[24] Additionally, hardware breakpoints, enabled through PERF_TYPE_BREAKPOINT since Linux 2.6.33, allow monitoring of read/write accesses or instruction execution at specific addresses using CPU debug registers.[24] Sampling in perf operates in modes that balance precision, overhead, and data granularity. Periodic sampling collects data at fixed intervals, either every N events (via sample_period) or at a target frequency (via sample_freq with the freq flag enabled), where the kernel adjusts dynamically to approximate the desired rate.[24] In contrast, precise sampling minimizes "skid"—the displacement between the sampled event and the recorded instruction pointer—using hardware features like Intel's Precise Event-Based Sampling (PEBS). PEBS, available on x86 when precise_ip is set to 1, 2, or 3, captures the exact instruction pointer and register state at the moment of event retirement, reducing uncertainty in attribution compared to standard interrupt-based sampling (precise_ip=0).[24] The precise_ip level specifies the requested precision: level 1 for skid-avoiding sampling, level 2 for maximally skid-free where possible, and level 3 mandating zero skid or disabling the event.[24] When the performance counter overflows—reaching the configured sample_period or frequency threshold—perf triggers an interrupt to handle the sample. This interrupt-based mechanism notifies userspace via poll(2), select(2), or signals, writing the sample data (including timestamp, PID, CPU, and event value) to a mmap(2)-ed buffer.[24] For deeper analysis, call stack capture can be enabled with the PERF_SAMPLE_CALLCHAIN bit in sample_type, recording the user or kernel stack backtrace up to a configurable depth limited by /proc/sys/kernel/perf_event_max_stack (default 127 frames since Linux 4.8).[24] Stack unwinding during overflow processing incurs additional CPU cost, particularly for user-space frames requiring frame pointer or DWARF-based reconstruction.[25] Aggregation in perf allows tailoring monitoring scope to specific contexts. CPU-wide aggregation (pid=-1, cpu>=0) captures events across all processes on a designated CPU, requiring elevated privileges like CAP_PERFMON or CAP_SYS_ADMIN.[24] Process-specific monitoring targets a single process (pid>0, cpu=-1 for any CPU or cpu>=0 for a specific one), while thread-level granularity follows individual threads by their task IDs.[24] This enables focused data collection without system-wide noise, though multiplexing may occur if hardware counters are exhausted. The sample rate in perf is fundamentally determined by the ratio of total events observed to the sampling period, yielding the number of samples as total_events / sampling_period.[16] For overhead estimation, a rough approximation considers the product of sampling frequency, maximum call stack depth, and the per-frame unwind cost (typically in cycles for stack walking), as overhead ≈ sampling_freq × stack_depth × unwind_cost; this highlights how higher precision and deeper traces amplify intrusion.[26]Key Subcommands
perf provides a suite of subcommands for performance analysis, each tailored to specific aspects of event monitoring and data handling. These tools leverage the underlying perf_events kernel interface to capture and interpret hardware and software events efficiently.[5] perf stat counts performance events over the duration of a workload, delivering aggregate statistics such as instructions executed, cycles, and derived metrics like instructions per cycle (IPC), which measures CPU efficiency. It supports system-wide or process-specific collection with options for event selection and detailed breakdowns, enabling quick assessment of basic performance characteristics without generating large data files.[27] perf record samples performance data into a file for offline analysis, capturing events like CPU cycles or cache misses at specified frequencies. A key option,-g or --call-graph, enables recording of call graphs to trace function call stacks in both kernel and user space, facilitating deeper profiling of code paths. This subcommand is essential for workloads requiring post-execution examination.[28]
perf report serves as an interactive viewer for data recorded by perf record, presenting hierarchical profiles sorted by overhead and allowing navigation through call graphs. Users can filter by symbols or apply thresholds like --percent-limit to focus on entries exceeding a specified overhead percentage, aiding in identification of bottlenecks. It supports sorting by various criteria, such as CPU time or memory latency.[29]
perf list enumerates all available performance events and performance monitoring units (PMUs), including hardware, software, cache, and tracepoint types. It displays symbolic names, raw encodings, and PMU-specific details, helping users select appropriate events with modifiers like precise sampling levels. This subcommand is crucial for discovering configurable monitoring capabilities on a given system.[30]
perf top offers real-time monitoring akin to the top utility, continuously sampling and displaying profiles of hot functions ordered by overhead. It updates dynamically to show current system or process activity, with options for event selection and PID targeting, providing immediate insights into performance hotspots without file I/O.[31]
Among other essential subcommands, perf script exports raw trace data from perf records for custom processing or scripting, using options like --dump-raw-trace to output verbose event details in a format suitable for further analysis tools. perf mem specializes in memory access profiling, recording and reporting load/store operations with support for latency analysis on platforms like Intel and ARM, using options such as -t to specify trace types. These extend perf's versatility for targeted investigations.[23][32]
Advanced Usage
Profiling Techniques
Perf provides several profiling techniques to identify performance bottlenecks in applications and the system, leveraging its sampling and tracing capabilities for detailed analysis. One common approach is CPU profiling, which captures instruction-level hotspots by sampling hardware events such as CPU cycles. To perform this, users executeperf record -e cycles -g to record samples at cycle events with call-graph information enabled via the -g flag, producing a perf.data file that includes stack traces for hotspots.[33] Subsequent analysis with perf report visualizes the data hierarchically, often piped to tools for flame graphs that illustrate call stack depths and frequencies, highlighting functions consuming the most cycles.[34] This method is particularly effective for pinpointing compute-intensive code paths in user-space applications or kernel routines.
For memory analysis, perf employs specialized sampling to examine access patterns and latencies. The perf mem record command captures load and store events, recording details like memory addresses and latencies to identify cache misses and bandwidth issues.[35] Analysis via perf mem report aggregates this data, showing distributions of access types (e.g., L1/L2 cache hits, DRAM accesses) and functions responsible for high-latency operations, such as those causing frequent last-level cache misses.[33] This technique helps diagnose memory-bound workloads by quantifying stall cycles due to data movement, guiding optimizations like data locality improvements.
I/O tracing in perf focuses on block device interactions to uncover disk bottlenecks. By recording tracepoints such as block:* (e.g., perf record -e block:*), users capture events like request issues, completions, and latencies for read/write operations.[36] The resulting traces, viewed with perf script or perf report, reveal per-request details including process IDs, byte counts, and queue depths, enabling identification of I/O-intensive processes or suboptimal access patterns like random seeks over sequential reads.[28]
Distinguishing kernel and user-space contributions requires accurate stack unwinding, often achieved with the --call-graph dwarf option in perf record. This uses DWARF debugging information to reconstruct full call stacks across boundaries, capturing transitions like syscalls without relying on frame pointers, which may be absent in optimized builds.[33] It ensures profiles show complete paths, such as user-space functions invoking kernel I/O routines, providing context for mixed-mode performance issues.[28]
Best practices for effective profiling begin with perf stat for quick, non-intrusive metrics like total cycles or cache miss rates over short runs, establishing baselines before deeper sampling.[33] For comprehensive investigations, scale to perf record with targeted events, adjusting sampling periods (e.g., via -F for frequency) to balance overhead and resolution. To mitigate noisy profiles from system variability, conduct multiple runs and aggregate results, using statistical methods to compute averages and confidence intervals for stable hotspot identification.[33] Always profile under representative workloads to ensure relevance, starting broad and narrowing to specific events as insights emerge.
Integration with Other Tools
Perf synergizes with eBPF through the BPF Compiler Collection (BCC), a toolkit that enables the creation of custom eBPF probes and scripts for kernel and user-space tracing.[37] Theperf trace subcommand can capture system-wide events, which BCC tools extend by attaching eBPF programs to kernel tracepoints, kprobes, or uprobes for dynamic instrumentation without kernel recompilation.[38] For instance, BCC's Python-based tools, such as execsnoop or biolatency, leverage eBPF to script complex traces that build on perf's event sampling, allowing users to filter and aggregate data in-kernel for low-overhead observability.[38]
For visualization, perf exports sampled stack traces via perf script, which can be processed into formats compatible with external tools like Flame Graphs. The process involves collapsing stacks with scripts from the FlameGraph repository—e.g., perf script | ./stackcollapse-perf.pl > out.folded followed by ./flamegraph.pl out.folded > profile.svg—to generate interactive SVGs highlighting CPU hotspots.[39][40] Similarly, perf report --stdio outputs text-based profiles that, when piped through perf script with fields like timestamps and symbols, can be imported into speedscope, a web-based viewer for interactive flame graph analysis of perf data.[41][42]
In debugging scenarios, perf interfaces with GDB by generating symbolic traces via perf script --fields sym,cpu, which provide stack frames, program counters, and CPU details for post-mortem analysis in GDB sessions.[43] This allows correlating perf's sampled events with GDB's disassembly and variable inspection for deeper code-level insights. For kernel tracing, perf integrates with ftrace through the perf ftrace subcommand, a wrapper that reads from /sys/kernel/debug/tracing/trace_pipe to capture function graphs, latencies, and profiles while supporting eBPF filters for targeted events.[44]
Perf supports containerized environments like Docker by requiring elevated capabilities such as --cap-add SYS_ADMIN (or --cap-add PERFMON on newer kernels) to access performance monitoring units (PMUs) and tracepoints inside isolated namespaces.[45] This enables low-level profiling of container workloads without full --privileged mode, complementing higher-level monitoring tools like cAdvisor, which aggregates cgroup-based metrics (e.g., CPU and memory usage) for Prometheus export, allowing combined views of container performance from coarse-grained resource stats to fine-grained perf traces.[46][47]
As of 2025, perf has enhanced synergy with Rust-based eBPF loaders, such as those using the aya-rs framework, which compile Rust code to eBPF bytecode for loading via libbpf and integration with perf's tracing ecosystem for safer, memory-safe kernel probes.[48][49] This allows developers to author custom eBPF programs in Rust that attach to perf events, improving observability in high-performance scenarios.
Concerns and Limitations
Performance Overhead and Criticisms
The primary sources of performance overhead in perf stem from sampling interrupts, which can consume 1-5% of CPU cycles in typical configurations, escalating to 14% or higher with multiple event instances. Ring buffer copies further contribute by transferring sampled data from kernel to userspace, adding latency during high-volume event capture, while stack unwinding—particularly for deep call stacks using frame pointers or DWARF debugging information—can introduce additional costs. These overheads arise during event monitoring, where interrupt handling and data processing interrupt normal execution flow, as seen in event sampling mechanics. Criticisms of perf often highlight its steep learning curve, attributed to the extensive array of options and subcommands that require familiarity with kernel internals and hardware specifics to use effectively. Cross-architecture support remains incomplete, with notably weaker performance monitoring capabilities on RISC-V platforms until improvements in 2024 via SBI PMU and Sscofpmf extensions. Additionally, sampling can yield misleading results for short workloads, as inconsistent sample sizes fail to capture representative event distributions despite fixed frequencies. To mitigate these overheads, users are advised to prioritize hardware performance monitoring unit (PMU) events over software approximations, which reduce kernel intervention and associated costs. Limiting sampling frequency to 1-2 kHz, below the default 4 kHz, balances detail against overhead by decreasing interrupt rates. Community discussions reveal debates over perceived bloat from perf's growing number of subcommands, which expand functionality but complicate the toolset for casual users. There have been calls for improved defaults, with kernel 6.10 introducing enhancements like better event subsystem features to streamline usage without custom tuning. As of kernel 6.12 (2025), further PMU support enhancements continue for architectures like RISC-V.[50]Security Implications
The security model of perf in Linux is designed to mitigate risks associated with performance monitoring, which can potentially expose sensitive system information. Access to perf_events is primarily controlled through kernel capabilities and theperf_event_paranoid parameter, which governs unprivileged user access. Processes with the CAP_SYS_ADMIN capability can bypass all restrictions, enabling full system-wide monitoring, though this is considered overly permissive for security-conscious environments.[51] Alternatively, the CAP_PERFMON capability, introduced in Linux kernel 5.8, provides a more targeted privilege for performance monitoring without the broader scope of CAP_SYS_ADMIN.[51] The perf_event_paranoid sysctl tunable offers four levels (-1 to 2) to restrict access: -1 imposes no limits; 0 allows system-wide monitoring excluding raw tracepoints; 1 limits to per-process events including kernel space; and 2 (the default) restricts to per-process user-space events only.[51] These mechanisms ensure that unprivileged users cannot monitor arbitrary processes or access kernel internals without explicit authorization.
Key risks in using perf stem from its ability to observe hardware performance counters and tracepoints, which can enable side-channel attacks leaking sensitive data such as memory addresses, execution contexts, or process behaviors.[51] For instance, unauthorized monitoring of other processes could reveal timing information exploitable for inferring cryptographic keys or private data, while improper configuration might allow info leaks via performance counter side effects.[51] To address unauthorized process monitoring, perf enforces ptrace-like scoping, where access is limited to processes under the same user ID or those attachable via ptrace rules, preventing cross-user surveillance without elevated privileges.[16]
Since Linux kernel 5.8, group-based access has been facilitated through the perf_users group, allowing non-root users to perform monitoring by assigning CAP_PERFMON (and related capabilities like CAP_SYS_PTRACE for older kernels) to the perf binary via file capabilities.[51] Administrators can create this group with groupadd perf_users, set ownership with chgrp perf_users /usr/bin/perf, restrict permissions with chmod o-rwx /usr/bin/perf, and apply capabilities using setcap "cap_perfmon,cap_sys_ptrace=ep" /usr/bin/perf.[51] This setup enables scoped, non-root usage while maintaining isolation from full administrative access.
In modern kernels as of 2025, security has been further bolstered by enhanced SELinux policies that include specific access controls for the perf_event class, such as watch permissions for monitoring events and attaching eBPF programs, preventing unauthorized syscall invocations.[52] Additionally, integration with eBPF leverages the kernel's verifier for sandboxing, which statically analyzes and limits eBPF programs attached to tracepoints, restricting them to safe operations and mitigating risks from malicious or erroneous tracing code.
Best practices for securing perf include setting kernel.perf_event_paranoid=2 in production environments to limit exposure, as this balances usability with protection against broad monitoring.[51] Enabling audit logging for perf_event_open syscall calls via auditd rules (e.g., -a always,exit -F arch=b64 -S perf_event_open -k perf_access) allows tracking and alerting on monitoring attempts, facilitating forensic analysis and policy enforcement.