Epoll

EpollMain

Community hub

Epoll

7 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

Epoll

View on Wikipedia

from Wikipedia

Not found

Revisions and contributors Edit on Wikipedia Read on Wikipedia

View on Grokipedia

from Grokipedia

epoll is a Linux-specific API for scalable I/O event notification, enabling efficient monitoring of multiple file descriptors for readiness to perform input/output operations.^[1] Introduced in Linux kernel version 2.5.44 in October 2002, it serves as an alternative to earlier mechanisms like select(2) and poll(2), offering improved performance for applications handling large numbers of connections, such as web servers.^[1]^[2] The core of epoll revolves around an in-kernel data structure called an epoll instance, which maintains two lists: an interest list of file descriptors to watch and a ready list of those ready for I/O.^[1] This instance is created using epoll_create(2) or epoll_create1(2), with file descriptors added, modified, or removed via epoll_ctl(2).^[1] Events are then retrieved efficiently using epoll_wait(2), which blocks until at least one descriptor becomes ready or a timeout expires.^[1] Unlike select(2) and poll(2), which require scanning all monitored descriptors on each call and thus scale poorly with increasing numbers (O(n) time complexity), epoll achieves better scalability by leveraging kernel-level event delivery, avoiding unnecessary user-kernel data copies and supporting up to thousands of descriptors with minimal overhead.^[1] It operates in two modes: level-triggered (LT, the default), which notifies as long as data remains available, and edge-triggered (ET), which signals only on state changes and demands non-blocking file descriptors to prevent blocking.^[1] While highly efficient, epoll has limitations, including a configurable per-user cap on total watches (default around 1/25th of available low memory) and Linux exclusivity, with no direct POSIX equivalent.^[1]^[3]

Overview

Definition and Purpose

Epoll is a Linux kernel facility for I/O event notification, enabling a process to monitor multiple file descriptors for readiness events such as readability, writability, or errors.^[1] It serves as an API for I/O multiplexing, allowing a single thread to efficiently manage I/O operations across numerous descriptors without the overhead of traditional polling mechanisms.^[1] The primary purpose of epoll is to support scalable I/O handling in high-performance applications, particularly those dealing with a large number of concurrent connections, by providing constant-time complexity for event retrieval regardless of the total number of monitored descriptors.^[4] This contrasts with earlier methods like select and poll, which exhibit linear O(n complexity in scanning all descriptors, making epoll ideal for scenarios requiring low-latency event detection.^[1] In its basic workflow, an application first creates an epoll instance to serve as a container for monitored file descriptors, then adds or modifies descriptors along with the specific events of interest using control operations, and finally blocks to wait for notifications of ready events.^[1] Epoll supports two notification modes—level-triggered, which signals continuously while conditions persist, and edge-triggered, which signals only on state changes—to suit different use cases.^[1] A key benefit of epoll is its ability to reduce CPU usage in environments with many idle connections, such as web servers, by avoiding unnecessary scans of inactive descriptors and minimizing kernel-to-user-space data copying.^[4] This efficiency enables servers to handle thousands of concurrent clients with predictable performance and low overhead.^[1]

Historical Development

Epoll was introduced by Davide Libenzi as an experimental feature in Linux kernel version 2.5.44 in late 2002, aimed at resolving scalability limitations in network servers that relied on earlier mechanisms like select and poll, which struggled with high numbers of file descriptors due to their O(n) scanning overhead.^[5]^[1] This initial implementation focused on providing an efficient event notification system for handling large-scale concurrent I/O operations, particularly in high-performance computing environments.^[6] The interface was refined through subsequent development cycles and stabilized with the release of Linux kernel 2.6.0 in December 2003, marking epoll's transition from experimental status to a core component for production use in high-concurrency applications.^[1] Early adoption followed in web servers such as Nginx, which leveraged epoll from its inception around 2004 for efficient non-blocking I/O handling, and later in Apache HTTP Server with the introduction of the event-based MPM in version 2.4 in 2012.^[7] Although epoll was initially focused on scalable network I/O for sockets and pipes, it has supported monitoring of any pollable file descriptor, including regular files, since its inception.^[1] A notable enhancement came in kernel 2.6.22 in 2007 with the introduction of eventfd, a mechanism for user-space event signaling that integrates seamlessly with epoll for inter-thread communication and timer notifications.^[8]^[9] Later enhancements include the EPOLLONESHOT flag in kernel 2.6.2 (2004) for one-time event notifications per file descriptor, and EPOLLEXCLUSIVE in kernel 4.5 (2016) to optimize shared epoll instances in multi-threaded applications by ensuring exclusive waking of one waiter.^[1] As of 2025, epoll continues to receive stability improvements, such as fixes for file descriptor lifetime races in kernel 6.10 (2024), solidifying its role in scalable, event-driven architectures in Linux-based systems.^[10]

Comparisons

With select and poll

The select system call, a traditional mechanism for I/O multiplexing in Unix-like systems, exhibits several limitations that hinder its scalability for high-concurrency applications. It requires scanning all monitored file descriptors on each invocation, resulting in O(n) time complexity where n is the number of file descriptors, as the kernel must check the entire interest set for readiness. Additionally, the glibc implementation fixes the fd_set size to 1024 descriptors by default via the FD_SETSIZE constant, imposing a practical limit on the number of file descriptors that can be monitored without recompiling the library or using workarounds. Each call to select also necessitates rebuilding the fd_set in user space by zeroing and resetting bits for interested events, incurring repeated data copying overhead between user and kernel space.^[11]^[12] The poll system call addresses some of select's shortcomings while retaining core inefficiencies. Unlike select, poll uses a dynamic array of pollfd structures, eliminating the fixed 1024-descriptor limit and allowing monitoring of arbitrarily large sets bounded only by system resources like memory. However, it still operates with O(n time complexity, as both declaring interests and retrieving ready events require the kernel to iterate over the entire array on each call, leading to similar scanning overhead for large numbers of descriptors. Like select, poll demands reconstructing the interest array in user space for every invocation, perpetuating data copying costs.^[12]^[13] Epoll surpasses both select and poll by decoupling event registration from retrieval, enabling more efficient management of large descriptor sets in the Linux kernel. It employs a kernel-managed red-black tree to store the interest set, achieving O(log n) complexity for additions and removals via epoll_ctl, while a separate ready list facilitates O(1) delivery of pending events through epoll_wait. This design avoids user-space iteration over all descriptors, as the kernel maintains persistent interest registrations and only returns ready events, minimizing data copying and eliminating the need to rebuild sets repeatedly. Epoll's modes—level-triggered and edge-triggered—provide flexibility beyond basic polling semantics, with edge-triggered mode notifying only on state changes to further reduce wakeups.^[14]^[12] In practice, these differences yield substantial performance gains for scenarios involving many connections, such as web servers handling thousands of concurrent clients. For instance, with 10,000 idle connections, select and poll can experience up to 79% throughput degradation due to exhaustive scanning of inactive descriptors, whereas epoll maintains near-constant performance by delivering only relevant events, avoiding wasted cycles on non-ready file descriptors.^[12]

With kqueue and IOCP

Kqueue provides a scalable event notification mechanism on BSD systems and macOS, serving as an efficient alternative to select and poll for monitoring diverse events across numerous descriptors.^[15] It employs an event queue model akin to epoll, where the kqueue() system call establishes a notification channel, and kevent() handles both event registration via a changelist and retrieval of pending events with optional timeouts.^[16] Kqueue supports level-triggered notifications by default, with edge-triggering enabled through flags like EV_CLEAR or EV_ONESHOT, and utilizes extensible filters to cover file descriptor I/O (EVFILT_READ and EVFILT_WRITE), signals (EVFILT_SIGNAL), filesystem modifications (EVFILT_VNODE), process events (EVFILT_PROC), and asynchronous I/O completions (EVFILT_AIO).^[15] However, kqueue remains specific to BSD-derived operating systems, including FreeBSD, OpenBSD, NetBSD, and macOS, and is unavailable natively on Linux.^[16] I/O Completion Ports (IOCP) offer Windows' approach to managing asynchronous I/O on multiprocessor systems through a queue-based threading model that enhances scalability for high-volume operations.^[17] IOCP relies on a thread pool to process completions, where CreateIoCompletionPort creates the port and associates file handles (such as sockets) for overlapped I/O, while GetQueuedCompletionStatus dequeues completion packets in FIFO order, including details on transferred bytes and errors.^[17] This design excels in handling concurrent overlapped I/O requests but demands complex state management, as buffers must persist until completion and operations require explicit queuing, without a direct equivalent to polling-based readiness checks.^[17] Both epoll and kqueue achieve O(1) scalability for event notifications, avoiding the O(N) overhead of traditional mechanisms, but kqueue provides a more unified framework with broader event filters—including direct signal handling—that epoll omits, as epoll focuses exclusively on file descriptor I/O events like readability and writability.^[15]^[1] Portability libraries like libevent bridge this gap by abstracting epoll and kqueue into a consistent API, enabling developers to register callbacks for events without platform-specific code.^[18] Epoll operates in a synchronous, non-blocking readiness model—alerting when a descriptor is ready for I/O, after which the application performs the operation—whereas IOCP delivers fully asynchronous completion notifications, queuing overlapped operations and signaling only upon their finish, which simplifies kernel-user transitions but complicates buffer and state handling. Linux's io_uring (introduced in kernel 5.1) provides a completion-based model akin to IOCP, supporting asynchronous I/O with reduced context switches, as of Linux kernel 6.12 (2025).^[17]^[19]^[20] Epoll's confinement to Linux prompts cross-platform projects, such as Node.js via its libuv library, to employ wrappers that dynamically select epoll on Linux, kqueue on BSD and macOS, or IOCP on Windows for unified I/O multiplexing.^[21]

Core API

System Calls

The epoll interface provides three primary system calls for creating and managing an epoll instance, controlling event interest on file descriptors, and waiting for I/O events.^[1] epoll_create() creates an epoll instance and returns a file descriptor referring to that instance, which is used in subsequent epoll operations; the size parameter, specifying a hint for the number of file descriptors to track, has been ignored since Linux 2.6.8 but must still be greater than zero.^[22] The function's synopsis is int epoll_create(int size);, and on success it returns a nonnegative file descriptor, or -1 on error with errno set.^[22] Common errors include EINVAL if size is not positive, EMFILE if the per-process limit on open files is reached, ENFILE if the system-wide limit is exceeded, and ENOMEM if insufficient kernel memory is available.^[22] epoll_create1(), introduced in Linux 2.6.27, is a variant that omits the size parameter and accepts a flags argument to control the behavior of the returned file descriptor.^[22] The synopsis is int epoll_create1(int flags);, where flags can be 0 or EPOLL_CLOEXEC to set the close-on-exec (O_CLOEXEC) flag, ensuring the descriptor is automatically closed in child processes created by fork().^[22] It returns a nonnegative file descriptor on success or -1 on error, with errors such as EINVAL for invalid flags or the memory-related ones noted above.^[22] epoll_ctl() manages the interest set for an epoll instance by adding, modifying, or deleting entries for target file descriptors.^[23] Its synopsis is int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);, where epfd is the epoll file descriptor, op specifies the operation (EPOLL_CTL_ADD to add fd with the events in event, EPOLL_CTL_MOD to modify an existing entry, or EPOLL_CTL_DEL to remove fd, in which case event may be NULL), fd is the target file descriptor, and event points to a structure defining the events of interest.^[23] The struct epoll_event contains a bitmask events for I/O events such as EPOLLIN (data available to read), EPOLLOUT (ready to write), EPOLLPRI (high-priority data available), EPOLLERR (error condition, always enabled), and EPOLLHUP (hangup, always enabled), along with optional flags like EPOLLET for edge-triggered mode; it also includes a data field (typically a pointer or union) for user-defined data associated with the file descriptor, which is returned unmodified by epoll_wait().^[23] On success, epoll_ctl() returns 0; on failure, it returns -1 with errno set to values like EBADF (invalid epfd or fd), EEXIST (attempt to add an already registered fd), EINVAL (invalid operation or event flags, such as using EPOLLEXCLUSIVE with non-socket fd), ENOENT (modify or delete on unregistered fd), ENOMEM (insufficient memory), or ENOSPC (exceeded system limit on epoll watches).^[23] epoll_wait() suspends the calling process until at least one event of interest becomes ready on an epoll instance or a timeout expires, populating an array with ready events.^[24] The synopsis is int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);, where events is a pointer to an array of struct epoll_event to receive up to maxevents ready events (must be greater than 0), and timeout is the number of milliseconds to wait (-1 for indefinite blocking, 0 for non-blocking poll).^[24] It returns the number of ready events on success (which may be 0 on timeout), or -1 on error with errno set.^[24] Errors include EBADF (invalid epfd), EFAULT (inaccessible events memory), EINTR (interrupted by signal), and EINVAL (invalid epfd or maxevents ≤ 0).^[24] epoll_pwait(), available since Linux 2.6.19, extends epoll_wait() by accepting an optional signal mask to atomically enable specific signals during the wait, preventing unintended interruptions.^[24] Its synopsis is int epoll_pwait(int epfd, struct epoll_event *events, int maxevents, int timeout, const sigset_t *sigmask);, with the same parameters as epoll_wait() plus sigmask (NULL to use the process's signal mask); this ensures the signal mask change and wait are performed atomically, equivalent to a protected sequence of pthread_sigmask() and epoll_wait().^[24] Return values and errors mirror those of epoll_wait(), with the atomicity providing guarantees against race conditions in signal handling during the wait operation.^[24] epoll_pwait2(), introduced in Linux 5.11, further extends epoll_pwait() by accepting a timespec structure for the timeout parameter, enabling nanosecond precision instead of milliseconds.^[24] Its synopsis is

int epoll_pwait2(int epfd, struct epoll_event *events, int maxevents, const struct timespec *timeout, const sigset_t *sigmask);

, where timeout is a pointer to a timespec (NULL for indefinite, non-NULL with tv_sec=0 and tv_nsec=0 for non-blocking). It shares the same return values and errors as epoll_pwait(), offering higher timeout resolution for applications requiring fine-grained control.^[24]

Data Structures

The primary user-space data structure for epoll operations is struct epoll_event, which is used to register interest in events for a file descriptor via epoll_ctl(2) and to retrieve notified events via epoll_wait(2).^[23] This structure consists of two main fields: a uint32_t events member that holds a bitmask specifying the types of events to monitor or that have occurred, and an epoll_data_t data union for associating user-defined data with the event.^[25] The events field is a bitmask composed of flags that indicate input events (related to reading or exceptional conditions), output events (related to writing), priority data availability, and error states.^[1] Key input event flags include EPOLLIN (data available for reading), EPOLLPRI (high-priority data available, analogous to POLLPRI in poll(2)), EPOLLRDHUP (remote peer has closed the connection or shut down its writing half, available since Linux 2.6.17), EPOLLHUP (hang-up or disconnection detected), and EPOLLERR (error condition on the file descriptor).^[23] The primary output event flag is EPOLLOUT (file descriptor is writable, such as buffer space available for writing).^[1] These flags can be combined using bitwise OR operations to monitor multiple conditions simultaneously, and control flags like EPOLLET (edge-triggered mode) or EPOLLONESHOT (one-shot behavior) may also be set, though they modify event delivery semantics rather than indicating event types.^[23] The epoll_data_t union provides flexibility for attaching arbitrary user data to an event, enabling efficient per-file-descriptor context without additional lookups during event handling.^[25] It includes four members: void *ptr (a generic pointer, often used to store a reference to an application-specific object like a connection structure), int fd (the file descriptor itself), uint32_t u32 (a 32-bit unsigned integer), and uint64_t u64 (a 64-bit unsigned integer).^[25] The union's size is determined by its largest member (u64, 8 bytes), allowing seamless storage and retrieval of context data returned by epoll_wait(2).^[24] In practice, struct epoll_event is typically 12 bytes on 32-bit systems and 16 bytes on 64-bit systems due to alignment padding after the 4-byte events field and 8-byte union. Arrays of this structure are passed to epoll_wait(2) to batch-retrieve multiple events in a single call, with the kernel populating the events and data fields for each ready file descriptor up to the specified maximum count.^[24]

Event Notification Modes

Level-Triggered Mode

Level-triggered (LT) mode is the default notification mode in epoll, activated when the EPOLLET flag is not specified during event registration with epoll_ctl(2).^[1] In this mode, epoll_wait(2) signals an event for a monitored file descriptor whenever the associated condition—such as data availability on a socket—holds true, regardless of whether the event was previously reported.^[1] The behavior of LT mode ensures persistent notifications: repeated calls to epoll_wait(2) will continue to return the file descriptor as ready until the condition is fully resolved, for instance, by reading all pending data from a socket buffer.^[1] This mirrors the semantics of traditional interfaces like select(2) and poll(2), where readiness is checked against the current state each time, but epoll achieves this more efficiently by separating event registration from retrieval, reducing data copying overhead.^[1]^[12] LT mode is particularly suitable for applications transitioning from select(2) or poll(2), offering simpler event handling for beginners while guaranteeing that no events are missed as long as the monitoring loop runs.^[1] However, it requires careful data draining to avoid busy-waiting loops, where the application repeatedly processes the same ready descriptor without progress.^[12] For example, consider a non-blocking TCP socket with 2 KB of incoming data in its receive buffer; in LT mode, the first epoll_wait(2) call reports EPOLLIN, and if only 1 KB is read via recv(2), subsequent epoll_wait(2) calls will still report the socket as ready until the remaining data is consumed.^[1] In contrast to edge-triggered mode, LT mode provides ongoing notifications for the duration of the condition rather than a single alert per state change.^[1]

Edge-Triggered Mode

Edge-triggered (ET) mode in epoll is activated by specifying the EPOLLET flag when adding or modifying a file descriptor using epoll_ctl(2).^[1] In this mode, epoll_wait(2) reports an event only upon a state transition for the monitored file descriptor, such as when data becomes available for reading (from unread to readable), rather than repeatedly while the condition remains true.^[1] Unlike the default level-triggered mode, which signals continuously as long as the condition persists, ET mode provides a single notification per change, promoting efficiency by minimizing unnecessary wakeups.^[1] The behavior of ET mode ensures that events are generated only for each distinct chunk of incoming data or state change; for instance, if 2 kB of data is written to a socket, consuming only 1 kB will not trigger another event until additional data arrives.^[1] To fully process events without missing data, file descriptors must be set to non-blocking mode using fcntl(2), and operations like read(2) or write(2) should continue in a loop until they return EAGAIN or EWOULDBLOCK, indicating no more immediate work.^[1] Failure to drain the file descriptor completely can lead to data starvation, where subsequent events are not notified until the next state change, potentially causing the application to hang on partially processed I/O.^[1] ET mode is particularly suited for high-performance, event-driven applications requiring minimal system call overhead and reduced context switches, such as web servers handling massive concurrency.^[1] It is commonly employed in Nginx, where all I/O events operate in ET mode to trigger notifications only on socket state changes, enhancing scalability by avoiding redundant polling.^[7] A representative example involves an incoming TCP connection on a listening socket monitored with EPOLLIN | EPOLLET: the initial acceptance triggers a single event, after which the application must call accept(2) and then loop on read(2) from the new socket until EAGAIN to consume all buffered data, ensuring no pending bytes are overlooked until the next edge transition.^[1]

Implementation and Performance

Kernel Internals

The kernel implementation of epoll centers on the struct eventpoll, a core data structure that encapsulates an epoll instance and is stored as the private_data in the associated file structure. This structure incorporates several key components for efficient event management: a red-black tree rooted in struct rb_root_cached rbr to store monitored file descriptors, enabling O(log n) insertion, deletion, and lookup operations; a doubly-linked list via struct list_head rdllist to hold ready events for O(1) retrieval during event polling; a mutex mtx for synchronizing access; and spinlocks to protect the ready list during concurrent modifications.^[14] Additionally, it includes wait queue heads such as wq for blocking processes and poll_wait for integration with file poll operations, ensuring thread-safe handling of event notifications.^[1]^[14] Event tracking occurs through struct epitem (epi_item), allocated for each watched file descriptor and linked into the red-black tree by its file descriptor value for quick searches. Each epi_item maintains the interested events in a struct epoll_event, tracks pending events via bitmasks, and includes a struct list_head rdllink for insertion into the ready list when events trigger. The kernel associates these items with the target file's wait queue using struct eppoll_entry, which queues the epoll instance onto the file's poll wait list during epoll_ctl additions, allowing direct propagation of readiness signals without repeated polling.^[14]^[26] The wakeup process is initiated kernel-side when an underlying file descriptor becomes ready, such as through a TCP socket callback or pipe write. In this scenario, the file's poll callback invokes ep_poll_callback, which checks for matching interested events in the epi_item and, if present, moves the item to the tail of the ready list using list_add_tail without any user-space intervention. This triggers a wakeup via wake_up_locked on the epoll wait queue, potentially awakening blocked threads, while mechanisms like ep_poll_safewake prevent thundering herd issues by limiting concurrent wakeups. The process integrates with the kernel scheduler through schedule_hrtimeout_range for timeout handling in blocking waits, ensuring efficient resumption of tasks.^[14]^[1] Epoll's multi-file-descriptor support extends to various kernel objects, including pipes, sockets, and eventfd, as long as they implement a poll method verifiable by file_can_poll. During addition via do_epoll_ctl, the kernel links the epi_item to the file's wait queue, enabling event propagation across these types; for instance, an eventfd write directly signals the associated epoll instance. This design allows seamless integration with the scheduler for waking threads across diverse I/O sources, maintaining low overhead even with thousands of descriptors.^[14]^[27]

Scalability Advantages

Epoll offers significant scalability advantages over traditional multiplexing mechanisms like select and poll, primarily through improved time complexities in its core operations. The epoll_ctl system call, used for adding, modifying, or deleting file descriptors from the interest set, operates with O(log n) time complexity, where n is the number of monitored descriptors, owing to the kernel's use of a self-balancing red-black tree for efficient insertions and deletions.^[28] In comparison, epoll_wait achieves an amortized O(1) complexity for notifying ready events when few are active, scaling to O(m) where m is the number of ready events, in stark contrast to the O(n) linear scan required by select and poll on every invocation, which becomes prohibitive as n grows large.^[12] Resource-wise, epoll minimizes overhead by maintaining a persistent kernel-side data structure for the interest set, consuming roughly 160 bytes per registered file descriptor on 64-bit systems.^[1] This efficiency enables applications to monitor millions of file descriptors without incurring memory bloat or corresponding CPU spikes from repeated user-kernel copies of descriptor lists, as occurs with select and poll. Benchmarks highlight these gains in the context of the C10K problem, where handling 10,000 concurrent connections is key. In tests using an event-driven HTTP server with 10,000 idle connections alongside active ones, epoll sustained reply rates of around 25,000 requests per second with minimal degradation, while select and poll dropped by up to 79% to approximately 5,000 requests per second due to scanning overhead.^[12] Real-world systems like Redis and memcached exploit epoll's scalability to manage thousands of concurrent client connections in production environments, achieving high throughput without proportional resource escalation.^[29] Despite these strengths, epoll's scalability has limits: epoll_wait processes events linearly in m, the count of ready descriptors, so it excels in sparse readiness scenarios where m << n but may approach poll-like costs when most descriptors become active simultaneously.^[12]

Limitations and Issues

Known Bugs

In versions of the Linux kernel prior to 2.6.28, a race condition in epoll_wait could lead to double insertion of events into the ready list under high load conditions, particularly when memory faults (EFAULT) occurred during event copying, resulting in incorrect event counts returned to userspace. This issue was addressed in kernel 2.6.28 by a commit that avoided double-inserts in the event delivery path through improved error handling and locking around the event queue operations.^[30] The edge-triggered mode (EPOLLET) has a documented behavior since its introduction in kernel 2.6 where partial reads or writes on stream-oriented file descriptors (such as pipes or sockets) can lead to data starvation if not all available data is consumed before the next epoll_wait call, as no further events are generated until a new state change occurs.^[1] Although not classified as a bug, this can manifest as missed data under certain configurations, and some kernel builds include warnings or documentation in /proc interfaces to highlight the need for non-blocking I/O and exhaustive draining of file descriptors in EPOLLET mode to prevent indefinite stalls.^[1] In kernels from version 3.9 onward, a race condition existed in epoll_ctl when modifying interest sets concurrently with closing file descriptors, potentially leading to use-after-free vulnerabilities if the kernel accessed freed file structures during event delivery. This was mitigated in kernel 3.14 (released in 2014) through reference counting improvements, including the use of RCU (Read-Copy-Update) to optimize EPOLL_CTL_DEL operations and ensure safe traversal of the epoll file tree without immediate freeing of structures under concurrent close calls.^[31] In the Linux kernel prior to 6.10 (released in 2024), a race condition in the epoll subsystem could cause the file reference count (f_count) to reach zero prematurely, leading to a use-after-free vulnerability where a dead file pointer is accessed, potentially allowing local privilege escalation or denial of service (CVE-2024-38580). This was fixed by improving file lifetime management in epoll's vfs_poll callbacks to prevent invalid reference drops under concurrent operations.^[10] As of kernel 6.12 (released in early 2025), a race in eventpoll allowed improper decrement of the epoll instance reference count while holding the mutex, potentially leading to use-after-free during concurrent access (CVE-2025-38349). The fix ensures reference counting is performed outside the critical section to avoid races in multi-threaded environments.^[32]

Common Pitfalls

One common pitfall when using epoll in edge-triggered (ET) mode arises from employing blocking file descriptors, which can cause the application to hang indefinitely. In ET mode, epoll notifies only when a new event occurs, but blocking operations like read or write may wait forever if data is partially available or the buffer is full, preventing further event processing. To avoid this, applications must set file descriptors to non-blocking mode using fcntl with O_NONBLOCK immediately after adding them to the epoll instance, ensuring operations return EAGAIN when no more data can be handled without blocking.^[1] Another frequent error is neglecting to check for EPOLLHUP and EPOLLERR events returned by epoll_wait, which can lead to resource leaks or undefined behavior. EPOLLHUP indicates a remote peer has closed the connection or a hang-up has occurred, while EPOLLERR signals an error on the file descriptor; ignoring these prevents proper cleanup, such as closing the descriptor and freeing associated resources. Best practice requires explicitly inspecting the events mask for these flags in the returned epoll_event structures and closing the affected file descriptors upon detection to maintain system stability.^[1] Reusing an epoll file descriptor across process forks without precautions can result in duplicate event notifications, complicating event handling and potentially causing race conditions. Upon fork, the child process inherits the parent's open file descriptors, including the epoll instance, leading to multiple processes monitoring the same events simultaneously. To mitigate this, specify the EPOLL_CLOEXEC flag during epoll_create1 to automatically close the descriptor in child processes, or manually close it in the child after forking.^[1] Setting the maxevents parameter in epoll_wait to an excessively small value risks losing events if more than that number are pending, as the kernel returns only up to maxevents and discards the excess without further notification. A reasonable limit, such as 1024, balances memory usage and completeness for most applications; exceeding this may require multiple calls to epoll_wait to drain all events, but failure to do so leads to incomplete I/O handling.^[24] Mixing epoll with traditional signal handling can introduce unpredictability, as signals may interrupt epoll_wait and cause EINTR errors without clear integration. Unless using epoll_pwait to atomically wait with a signal mask, this approach is error-prone; instead, prefer eventfd for inter-thread or process signaling, which provides a clean, pollable descriptor integrable directly into the epoll set without signal complications.

Info Pages

Talk Pages

Special Pages

Epoll

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Epoll

Epoll

Overview

Definition and Purpose

Historical Development

Comparisons

With select and poll

With kqueue and IOCP

Core API

System Calls

Data Structures

Event Notification Modes

Level-Triggered Mode

Edge-Triggered Mode

Implementation and Performance

Kernel Internals

Scalability Advantages

Limitations and Issues

Known Bugs

Common Pitfalls

References

Add your contribution

Related Hubs

Contribute something

History

Epoll

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Epoll

Epoll

Overview

Definition and Purpose

Historical Development

Comparisons

With select and poll

With kqueue and IOCP

Core API

System Calls

Data Structures

Event Notification Modes

Level-Triggered Mode

Edge-Triggered Mode

Implementation and Performance

Kernel Internals

Scalability Advantages

Limitations and Issues

Known Bugs

Common Pitfalls

References

Add your contribution

Related Hubs

Contribute something