Recent from talks
Contribute something
Nothing was collected or created yet.
Slab allocation
View on Wikipedia
Slab allocation is a memory management mechanism intended for the efficient memory allocation of objects. In comparison with earlier mechanisms, it reduces fragmentation caused by allocations and deallocations. This technique is used for retaining allocated memory containing a data object of a certain type for reuse upon subsequent allocations of objects of the same type. It is analogous to an object pool, but only applies to memory, not other resources.
Slab allocation was first introduced in the Solaris 2.4 kernel by Jeff Bonwick.[1] Bonwick claims the name "Slab" comes from a Kellogg's cereal commercial catchphrase rhyme, "grab a slab".[2]
Slab allocation is now widely used by many Unix and Unix-like operating systems including FreeBSD[3] and Linux,[4] both in the SLAB allocator and its replacement, SLUB.[5]
Basis
[edit]Slab allocation significantly reduces the frequency of computationally costly initialization and destruction of kernel data-objects, which can outweigh the cost of allocating memory for them.[1] When the kernel creates and deletes objects often, overhead costs of initialization can result in significant performance drops. Object caching leads to less frequent invocation of functions which initialize object state: when a slab-allocated object is released after use, the slab allocation system typically keeps it cached (rather than doing the work of destroying it) ready for re-use next time an object of that type is needed (thus avoiding the work of constructing and initialising a new object).
With slab allocation, a cache for a certain type or size of data object has a number of pre-allocated "slabs" of memory; within each slab there are memory chunks of fixed size suitable for the objects.[6] The slab allocator keeps track of these chunks, so that when it receives a request to allocate memory for a data object of a certain type, usually it can satisfy the request with a free slot (chunk) from an existing slab. When the allocator is asked to free the object's memory, it just adds the slot to the containing slab's list of free (unused) slots. The next call to create an object of the same type (or allocate memory of the same size) will return that memory slot (or some other free slot) and remove it from the list of free slots. This process eliminates the need to search for suitable memory space and greatly alleviates memory fragmentation. In this context, a slab is one or more contiguous pages in the memory containing pre-allocated memory chunks.
Implementation
[edit]The slab allocation algorithm defines the following terms:
- Cache: cache represents a small amount of very fast memory. A cache is a storage for a specific type of object, such as semaphores, process descriptors, file objects, etc.
- Slab: slab represents a contiguous piece of memory, usually made of several virtually contiguous pages. The slab is the actual container of data associated with objects of the specific kind of the containing cache.
When a program sets up a cache, it allocates a number of objects to the slabs associated with that cache. This number depends on the size of the associated slabs.
Slabs may exist in one of the following states:
- empty – all objects on a slab marked as free
- partial – slab consists of both used and free objects
- full – all objects on a slab marked as used
Initially, the system marks each slab as "empty". When the process calls for a new kernel object, the system tries to find a free location for that object on a partial slab in a cache for that type of object. If no such location exists, the system allocates a new slab from contiguous virtual pages and assigns it to a cache. The new object gets allocated from this slab, and its location becomes marked as "partial".
The allocation takes place quickly, because the system builds the objects in advance and readily allocates them from a slab.
Implementation techniques
[edit]Free lists
[edit]A slab represents one memory allocation to the cache from the machine, and whose size is customarily a multiple of the page size. The slab will be divided into a number of entries, which will then be requested by the cache as the client code requests memory for new objects. It is necessary then to keep track of which parts of the slab are free to use and which ones were already occupied. This is generally done using "free lists": lists of free entries in the slab ready to store new objects.
The free list may be a separate data structure, such as an array of indices indicating which entries of the slab are free, or it may be embedded within the slab. The Linux SLUB allocator keeps the free list as a linked list of pointers, each of which is stored directly in the free memory area of the slab they represent.[7]
Slab sizes
[edit]Operating systems may use different slab sizes and internal layouts depending on the size of the objects to be stored. The reason for the large slabs having a different layout from the small slabs is that it allows large slabs to pack better into page-size units, which helps with fragmentation. For example, objects that are at least 1/8 of the page size for a given machine may benefit from a "large slab" size, with explicit free lists, while smaller objects may use a "small slab" setup, embed the free list tracking. Bonwick's original presentation of the slab allocator already made the distinction of layouts for large and small slabs.[1]
Systems using slab allocation
[edit]- AmigaOS (introduced in AmigaOS 4)
- DragonFly BSD (introduced in release 1.0)
- FreeBSD (introduced in 5.0)
- GNU Mach[8]
- Haiku (introduced in alpha 2)
- Horizon (Nintendo Switch microkernel)[9]
- HP-UX (introduced in 11i)[10]
- Linux (introduced in 2.1.23)
- NetBSD (introduced in 4.0[citation needed])
- Solaris (introduced in 2.4)
- The Perl 5 compiler uses a slab allocator for internal memory management[11][12]
- Memcached uses slab allocation for memory management
- illumos
See also
[edit]Notes
[edit]- ^ a b c Jeff Bonwick, The Slab Allocator: An Object-Caching Kernel Memory Allocator (1994)
- ^ Bonwick, Jeff (14 June 2005). "The story behind the slab allocator". Oracle. Archived from the original on 4 March 2016. Retrieved 27 March 2025.
- ^ FreeBSD Kernel Developer's Manual
- ^ M. Tim Jones, Anatomy of the Linux slab allocator Archived 2 October 2013 at the Wayback Machine
- ^ Vlastimil Babka, remove the SLAB allocator
- ^ Abraham Silberschatz et al.: Operating system concepts. Wiley: 2004. ISBN 0-471-69466-5
- ^ Lameter, Christoph. "Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB" (PDF). LinuxCon/Düsseldorf 2014 (Revision Oct 3, 2014).
- ^ "Gnu Mach Allocator Documentation".
- ^ "Console Security – Switch (34c3)". media.ccc.de. Retrieved 28 December 2017.
- ^ Chris Cooper and Chris Moore, HP-UX 11i Internals, Upper Saddle River, New Jersey: Prentice Hall PTR, 2004, ISBN 0-13-032861-8, p. 334.
- ^ "Perl5-Porters Weekly: 2012 June 17". Retrieved 18 November 2012.
- ^ Bonwick, Jeff. "Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources". USENIX 2001. Retrieved 18 November 2012.
External links
[edit]- FreeBSD uma(9) manual page
- The SLUB allocator comment about management of slabs in Linux by two different allocators: SLUB allocator and SLAB allocator
- Memory Compaction v7 (a Linux patch set from Mel Gorman dealing with SLAB fragmentation and compaction issues, 2 April 2010)
- Detecting kernel memory leaks Jonathan Corbet, Linux Weekly News, 2006; includes user comments on garbage collection
- Linux performance: is Linux becoming just too slow and bloated? On SLAB and SLUB. Free software magazine 2010.
Slab allocation
View on Grokipediakmem_cache_create() for initializing caches, kmem_cache_alloc() and kmem_cache_free() for object handling, and kmem_cache_destroy() for cleanup, all of which support flags for behaviors like high-memory allocation or debugging.[3] These developments enhance kernel performance by cutting allocation times—often to near-constant O(1) complexity—and mitigating issues like memory hotplug and NUMA awareness in modern hardware.[2]
Overview and History
Definition and Purpose
Slab allocation is a memory management technique used in operating system kernels to efficiently handle the allocation and deallocation of fixed-size kernel objects. It operates by pre-allocating contiguous blocks of memory known as slabs, which are subdivided into fixed-size chunks tailored to specific object types, thereby enabling the reuse of these objects without the overhead of repeated memory searches or reinitialization.[1] The primary purpose of slab allocation is to minimize the time and space costs associated with frequent object creation and destruction, particularly for complex kernel structures that require initialization of embedded components such as locks or reference counts. By implementing object caching, it retains the state of allocated objects between uses, allowing for rapid servicing of allocation requests from a pre-constructed pool rather than invoking costly dynamic memory operations each time. This approach significantly enhances performance; for instance, the allocation time for stream head objects in SunOS 5.4 was reduced from 33 µs to 5.7 µs on a SPARCstation-2.[1] In its high-level workflow, slab allocation maintains dedicated caches for each object type. When a request for an object arrives, it is served directly from the corresponding cache if available; otherwise, a new slab is allocated from the kernel's memory pool, populated with the required objects, and added to the cache. Deallocation simply returns the object to the cache without destruction, preserving its initialized state for future reuse, while reclamation mechanisms destroy and free slabs only under memory pressure.[1] This method is particularly suited to kernel environments with recurring allocations of similar objects, such as inodes for file system management or task structures for process handling, where grouping allocations by type optimizes both spatial locality and initialization efficiency.[1]Historical Development
Slab allocation was introduced by Jeff Bonwick at Sun Microsystems in 1994 as part of the kernel memory allocator for SunOS 5.4, which corresponds to Solaris 2.4.[6] This design replaced the previous System V Release 4 (SVr4)-based allocator, aiming to improve efficiency for kernel object management by using object-caching primitives that minimize initialization overhead and fragmentation.[1] The allocator addressed key limitations of buddy allocators, which were common in Unix kernels of the era, such as external fragmentation and the high cost of object construction and destruction for frequently allocated kernel structures.[6] Its influence spread to other systems in the late 1990s; for instance, FreeBSD incorporated a zone-based allocator in version 3.0 (released in 1998), which was later enhanced to provide slab-like functionality in version 5.0 (2003).[7] In the Linux kernel, the SLAB implementation—directly inspired by Bonwick's work—was integrated starting with Linux 2.2 in 1999, enhancing performance over the prior K&R-style allocator.[8] Subsequent evolutions in Linux included the SLOB variant, a lightweight implementation suited for embedded systems with constrained memory, introduced in 2005.[8][9] To address scalability issues on multi-processor systems, the SLUB allocator was developed as a simpler, unqueued alternative and became the default in Linux 2.6.23 in October 2007, offering better performance by reducing locking overhead and per-CPU caching.[10] SLUB also incorporated support for huge pages to optimize memory usage in large-scale environments. As of 2025, SLUB remains the primary slab allocator in the Linux kernel, with the original SLAB variant fully removed in kernel version 6.8 (late 2023) and SLOB removed in kernel version 6.4 (mid-2023).[5] Ongoing optimizations focus on multi-core scalability and integration with modern hardware features like huge pages, but no fundamental overhauls have occurred since SLUB's adoption.[5]Motivations and Problems Addressed
Memory Fragmentation in Traditional Allocators
Traditional kernel memory allocators, such as the buddy system commonly used in Unix-like operating systems, suffer from external fragmentation, where free memory becomes scattered into small, non-contiguous blocks over time due to repeated allocations and deallocations. This scattering prevents the satisfaction of requests for large contiguous regions, even when the total free memory is sufficient, as the buddy algorithm merges only adjacent blocks of equal size (powers of two) but fails to coalesce disparate fragments efficiently under heavy workloads.[11] In kernel environments, this issue is exacerbated by long-lived allocations for structures like page tables or I/O buffers, leading to allocation failures and the need for costly memory compaction or reclamation processes.[12] Internal fragmentation in these allocators arises from the allocation of fixed-size blocks that exceed the requested size, resulting in wasted space within each allocated unit. The buddy system's power-of-two block sizes mean that a request slightly larger than half a block size receives the next larger power-of-two block, potentially wasting up to 50% of the space in the worst case, with an expected waste of around 25-28% across allocations.[13][14] For small kernel objects, such as process descriptors or inodes typically under 1 KB, traditional allocators often rounded up to full page sizes (e.g., 4 KB), amplifying internal waste to as much as around 60% per allocation and contributing to overall memory inefficiency.[11] In high-load kernel scenarios, like those in early SunOS and BSD systems, these fragmentation types combined to waste 40-50% of available memory, as observed in benchmarks such as the kenbus workload where traditional power-of-two allocators in SVr4 exhibited 46% fragmentation.[1] Frequent small, fixed-size allocations for kernel data structures, such as task control blocks or semaphore objects, intensified the problem by creating numerous partially used blocks, increasing allocation failure rates and overhead from frequent searches through free lists.[1] Compared to user-space general-purpose allocators like malloc, which handle variable-sized requests with techniques like binning to mitigate fragmentation, kernel allocators face heightened challenges due to the predictable yet frequent demand for fixed-size objects in a real-time, non-garbage-collected environment, making fragmentation more detrimental to system stability and performance.[11]Overhead of Object Initialization
In traditional kernel memory allocators, such as those based on sequential-fit or buddy systems, the initialization of each newly allocated object incurs significant computational overhead. This process typically involves zeroing out the allocated memory to ensure security and prevent data leakage, initializing object-specific fields like locks, pointers, and counters, and linking the object into relevant kernel data structures such as lists or hash tables. These steps can consume thousands of CPU cycles per object, often exceeding the costs of the underlying memory allocation itself, as measured in pre-slab SunOS implementations where object setup dominated performance profiles.[1] Deallocation introduces comparable overhead through a reverse set of operations, including unlinking the object from data structures, resetting fields to a safe state, and in buddy-based systems, attempting to coalesce freed blocks with adjacent buddies to combat fragmentation. This coalescing step requires searching for matching free blocks and merging them, which adds variable latency depending on the memory layout and can lead to spikes during allocation bursts when multiple objects are freed concurrently. While fragmentation represents a related issue of spatial inefficiency, the runtime costs of these initialization and deallocation routines primarily manifest as temporal delays in object lifecycle management.[1] A concrete example of this overhead arises in process creation operations, such as the fork() system call in Unix-like kernels, where structures like the task_struct in Linux must be allocated and fully initialized for each new task. This includes copying parent process state, setting up scheduling parameters, and initializing security contexts, which collectively bottleneck system performance under high load, as repeated init/deinit cycles amplify delays in multi-process environments. Kernel objects such as inodes for file system operations or semaphores for synchronization are often short-lived yet requested at high frequency, leading to cumulative overhead that degrades overall throughput in demanding workloads.[1] Prior to the 1990s, early kernel designs lacked dedicated object pools or caching mechanisms, relying instead on ad-hoc general-purpose allocators like simple malloc/free implementations integrated with buddy or sequential-fit heuristics. These approaches exacerbated initialization costs in multi-user systems, where concurrent allocations of complex objects—such as streams or tasks—resulted in unacceptable latency, prompting the development of more efficient strategies to mitigate such inefficiencies.[1]Core Concepts
Caches and Slabs
In slab allocation, a cache serves as the primary organizational unit for managing memory objects of a specific type, such as kernel structures liketask_struct or inodes. Each cache, often implemented as a kmem_cache structure in systems like Linux, oversees the lifecycle of multiple slabs dedicated to that object type, maintaining global statistics including total object usage, allocation counts, and configurable growth limits to control memory expansion. Caches provide a centralized interface for allocation and deallocation requests, ensuring that objects are pre-initialized and readily available to minimize runtime overhead. This design allows for type-specific optimizations, where clients create caches via primitives like kmem_cache_create, specifying parameters such as object size, alignment requirements, and optional constructor functions for initialization.[1][2]
A slab, in contrast, represents a contiguous block of memory pages—typically one or more 4 KB pages in common implementations—partitioned into fixed-size slots tailored to hold objects of a single cache's type. Each slab includes embedded metadata for tracking object availability, such as freelist pointers and usage counters, which enable efficient navigation without external data structures. For instance, a slab for 400-byte objects uses a single 4 KB page to store 10 such objects, resulting in approximately 2.4% internal fragmentation. Slabs are sourced dynamically from a backing memory allocator, such as the buddy system, when a cache requires additional capacity, ensuring they align with the system's page granularity for minimal waste.[1][2]
The relationship between caches and slabs is hierarchical and list-based: a cache maintains separate queues of slabs categorized by their state—full (no free objects), partial (some free objects), and empty (all objects free)—with allocation requests preferentially serviced from partial slabs to maximize reuse and reduce fragmentation. This organization allows caches to grow or shrink by adding or removing slabs as demand fluctuates, while empty slabs can be returned to the backing allocator for reclamation. Object states within slabs, such as allocated or free, are managed internally but contribute to the slab's overall categorization in the cache.[1][2]
Object Allocation States
In the slab allocation mechanism, objects within a slab progress through distinct lifecycle states that facilitate efficient reuse and minimize initialization overhead. The primary states are unused (free slots available for allocation), allocated (currently in use by the kernel), and partially initialized (pre-configured with common default values at the slab level). Unused objects reside on a free list within their slab, ready for immediate assignment without full reinitialization, while allocated objects are actively referenced by kernel components. The partially initialized state applies to objects that have undergone batch initialization via a constructor function during slab creation, setting shared attributes like zeroed fields or standard metadata to enable rapid deployment.[2] Transitions between these states are designed for speed and simplicity. Upon allocation, an unused or partially initialized object is transitioned to the allocated state with minimal additional setup, such as linking it to the requesting kernel subsystem, avoiding per-allocation constructors. Deallocation reverses this by marking the object as unused, returning it to the free list without invoking a full destructor to preserve its partially initialized form for future use. This approach contrasts with traditional allocators by retaining object state across cycles, reducing latency from costly zeroing or custom setup operations.[15] The benefits of these states stem from the use of optional constructor (ctor) and destructor (dtor) hooks, which are invoked only during slab growth or shrinkage rather than individual allocations. The ctor pre-initializes all objects in a new slab with common defaults, such as clearing sensitive fields, while the dtor performs cleanup only when a slab is fully emptied and returned to the page allocator, ensuring resources like locks are released. This selective invocation cuts initialization costs significantly—for instance, in the original implementation, it reduced stream head allocation time from 33 μs to 5.7 μs by caching initialized states. By maintaining partially initialized objects, the system avoids redundant zeroing for frequently allocated kernel structures, enhancing throughput in high-demand scenarios.[2][15] Tracking these states occurs at the object slot level within the slab structure. For small objects, a flag or embedded pointer (such as a freelist link) indicates the state, utilizing unused padding space to avoid metadata overhead; larger objects may use a separate buffer control array to map indices to free or allocated status. This lightweight mechanism ensures constant-time state queries and updates, integrating seamlessly with the enclosing cache and slab descriptors.[2]Implementation Details
Slab Creation and Management
Slab creation in the slab allocator is triggered when a cache requires additional objects and has no available partial or full slabs with free space. In this scenario, the allocator invokes a growth function, such askmem_cache_grow() in implementations like Linux, to request pages from the underlying page allocator (e.g., the buddy system). These pages are then divided into fixed-size objects matching the cache's specifications, with metadata structures initialized to track object locations and states. If a constructor function is provided during cache setup, it is executed for each object to perform initialization, ensuring objects are in a ready state without per-allocation overhead.[2][1]
Management of slabs within a cache involves maintaining lists to categorize them by utilization: full, partial, and free. Policies enforce limits on the total number of slabs per cache to avoid unbounded memory consumption, often by capping the number of objects or slabs based on system constraints. Shrinking occurs under memory pressure, where empty (free) slabs are identified and their pages returned to the page allocator via functions like kmem_cache_shrink(), reclaiming contiguous memory blocks. Growth strategies typically begin with a minimum number of slabs upon cache creation and expand incrementally on demand, such as when the cache's partial slabs are depleted, to balance responsiveness and resource use. Post-creation, objects within new slabs start in a free state, ready for allocation.[2][16]
Error handling during slab creation addresses failures in page allocation, often due to out-of-memory (OOM) conditions. If the page allocator cannot fulfill the request, the operation fails gracefully, with the kernel logging warnings via mechanisms like printk() for debugging OOM killers or resource exhaustion. In such cases, higher-level allocators like kmalloc() may fall back to alternative strategies, such as using larger general-purpose pools, to ensure system stability. For example, in the Linux kernel, the kmem_cache_create() function establishes cache parameters including size, alignment, and constructor, while object allocation routines first attempt to retrieve from partial slabs via get_partial() and, if unsuccessful, trigger new_slab() to create a new one.[2][16]
Allocation and Deallocation Process
In slab allocation, the process of requesting an individual object from a cache begins by searching the free list of partial slabs within the cache. In implementations with per-CPU caches, such as the Linux kernel, available objects are first checked from the local per-CPU cache to minimize contention in multiprocessor environments. Should partial slabs also lack free objects, a new slab is created and populated with initialized objects, as detailed in slab creation procedures. The selected object is then marked as allocated, and its pointer is returned to the requester.[1][2] Deallocation reverses this flow by first validating the object's address and cache affiliation to prevent errors. In some implementations like Linux, if a destructor function is registered for the object type, it is invoked to clean up resources before freeing. The object is subsequently added to the appropriate free list, typically the per-CPU freelist for quick reuse or the partial slab's list otherwise. If the slab becomes entirely empty after deallocation, it is marked for potential reclamation under memory pressure, though it may remain in the cache's free slab list for future allocations.[2] To manage concurrency, slab allocators employ cache-wide locks, such as spinlocks, when accessing shared free lists in partial or full slabs. These locks are minimized in per-CPU designs, where local caches allow lock-free operations for most allocations and deallocations on the same processor.[1][2] The core algorithms can be outlined in pseudocode as follows: Allocation Pseudocode:kmem_cache_alloc(struct kmem_cache *cache, int flags) {
if (free_object = pop_from_percpu_freelist(cache)) {
return free_object;
} else if (free_object = pop_from_partial_slab_freelist(cache)) {
return free_object;
} else {
create_new_slab(cache);
free_object = pop_from_new_slab_freelist(cache);
return free_object;
}
}
kmem_cache_alloc(struct kmem_cache *cache, int flags) {
if (free_object = pop_from_percpu_freelist(cache)) {
return free_object;
} else if (free_object = pop_from_partial_slab_freelist(cache)) {
return free_object;
} else {
create_new_slab(cache);
free_object = pop_from_new_slab_freelist(cache);
return free_object;
}
}
kmem_cache_free(struct kmem_cache *cache, void *object) {
validate_object(cache, object);
if (destructor) destructor(object);
obj_index = compute_index_in_slab(object);
push_to_freelist(cache, obj_index); // To per-CPU or partial slab
if (slab_is_empty(cache, slab_of(object))) {
mark_slab_for_reclamation(slab_of(object));
}
}
kmem_cache_free(struct kmem_cache *cache, void *object) {
validate_object(cache, object);
if (destructor) destructor(object);
obj_index = compute_index_in_slab(object);
push_to_freelist(cache, obj_index); // To per-CPU or partial slab
if (slab_is_empty(cache, slab_of(object))) {
mark_slab_for_reclamation(slab_of(object));
}
}
Advanced Techniques
Slab Coloring
Slab coloring is a technique used in the slab allocator to optimize processor cache performance by distributing object addresses evenly across cache lines, thereby reducing false sharing and cache conflicts among concurrently accessed objects. In this approach, each slab is assigned a unique offset, or "color," from its page-aligned base address, which shifts the starting positions of objects within the slab. This prevents multiple unrelated objects from mapping to the same cache line, which could otherwise lead to thrashing and increased miss rates in multi-threaded environments like operating system kernels. The method addresses the limitations of traditional power-of-two allocators, which often align objects poorly with cache geometries, leading to suboptimal utilization.[6] Implementation involves calculating the color during slab creation, where the offset is chosen from a sequence of values that fit within the unused space of the slab page. For instance, with 200-byte objects allocated from 4KB pages on systems with 8-byte cache line granularity, colors range from 0 to 64 bytes in 8-byte increments, ensuring that successive slabs use different offsets to spread objects across cache lines. The maximum number of colors is limited by the slab's free space after accounting for object sizes and metadata, and the allocator cycles through these colors for new slabs in a cache. This padding introduces minimal overhead since colors exploit the natural slack in page-sized slabs.[6] The benefits of slab coloring include significant improvements in cache hit rates and overall system throughput, particularly on multiprocessor systems. On the SPARCcenter 2000, it reduced primary cache miss rates by 13% and improved bus balance from 43% to 17% imbalance, while benchmarks showed 5% fewer primary cache misses during parallel kernel builds. These gains stem from better cache line utilization and reduced memory traffic concentration, making it especially effective for kernel workloads with high contention. Slab coloring was introduced in the original SunOS 5.4 slab allocator design, tailored for SPARC processors to enhance multiprocessor scalability.[6]Per-CPU Caches
In multi-processor systems, per-CPU caches in slab allocators address scalability issues arising from global lock contention during allocation and deallocation operations. Each CPU maintains a small, local cache consisting of partial slabs or free objects, which is replenished from a global depot or cache only when depleted. This design ensures that most memory operations occur locally without acquiring shared locks, thereby minimizing inter-CPU communication and cache line bouncing.[17] The mechanism employs a layered approach where the per-CPU cache acts as a fast-access buffer, often implemented as a magazine—a fixed-size array or list of object pointers per CPU—stocked with pre-allocated items from larger slabs. For allocation, the requesting CPU first attempts to pop an object from its local magazine; if empty, it swaps with a secondary local buffer or fetches a full magazine from the global depot using atomic operations like compare-and-swap (cmpxchg) to avoid locks. Deallocation pushes objects back to the local magazine, with overflow items returned to the depot only on imbalance, such as when a CPU's cache exceeds capacity or detects uneven distribution across processors. Object migration between CPUs is rare and triggered solely by such imbalances, preserving locality.[17][18] This approach yields significant performance benefits, reducing the time spent holding global locks from microseconds to nanoseconds per operation and enabling linear scalability with the number of cores, as demonstrated in benchmarks showing doubled throughput on multi-CPU systems under high load. By localizing access, it also lowers remote memory latency and cache miss rates, with empirical results indicating miss rates bounded by the inverse of the magazine size.[17][10] Tuning of per-CPU caches balances efficiency against memory overhead, typically limiting the cache to 1-2 partial slabs or a small number of objects (e.g., 6-120 depending on object size) per CPU to prevent excessive remote accesses or wasted space on idle processors. Dynamic adjustment of cache sizes, based on contention metrics, further optimizes usage without manual intervention.[18][17] In the Linux SLUB allocator, per-CPU caches are realized through partial lists embedded in thekmem_cache_cpu structure, where each CPU holds a freelist of objects from an active slab for lockless fast-path operations on supported architectures. When local partial lists overflow or deplete, objects are migrated remotely via atomic exchanges, integrating seamlessly with the broader allocation process.[18][10]
As of October 2025, the Linux kernel (version 6.18) introduced Sheaves, an opt-in per-CPU array-based caching layer for the SLUB allocator that replaces traditional CPU partial slabs. This enhancement aims to reduce overhead in per-CPU operations and improve overall performance scalability.[19]
Free List Management
In slab allocators, free list management involves tracking and linking unallocated objects within slabs to enable rapid reuse during allocation requests. Each slab maintains its own freelist, which serves as a linked list of available objects, ensuring that allocations and deallocations operate in constant time by simply adjusting pointers and reference counts. This approach preserves object initialization and reduces fragmentation by keeping related objects grouped and ready for immediate use.[1] For small objects—typically those smaller than half the slab size—the freelist pointer is embedded directly within the object itself, often at the end of the buffer, pointing to the next free object in the list. This embedded strategy minimizes metadata overhead by repurposing space in the object for linkage, such as using akmem_bufctl structure that includes the freelist pointer and buffer address. In contrast, larger objects employ a separate data structure, such as an array of indices (kmem_bufctl_t in Linux SLAB implementations) or a bitmap, stored either on-slab (within the slab's initial space) or off-slab (in a dedicated cache) to track free object positions without invading object space. For instance, in the original slab design, slabs under 1/8 page use embedded kmem_bufctl for efficiency, while the Linux SLAB allocator uses an on-slab kmem_bufctl_t array for objects under 512 bytes, initializing it as a pseudo-linked list with sequential indices ending in a sentinel value like BUFCTL_END.[1][2]
Common strategies for freelist organization include LIFO (last-in, first-out), where objects are added and removed from the head of the list for simplicity and cache locality, though FIFO (first-in, first-out) variants exist in some adaptations. In the SLUB allocator, freelists operate at the page level, with metadata embedded in the struct page (using fields like freelist for the head pointer, inuse for allocated count, and offset for pointer placement) and pointers chained through free objects themselves, enabling lockless per-CPU access. Maintenance during operations is straightforward: on deallocation, the object is pushed to the front of the freelist by updating the previous head to point to it, and the reference count is decremented; on allocation, the head is popped, metadata cleared (e.g., zeroing the embedded pointer), and the count incremented, with the slab transitioned to full or empty lists as needed. The original design exemplifies this with each slab holding a freelist head in its kmem_slab structure, while Linux SLAB updates the slab_t->free index to traverse the kmem_bufctl_t array.[1][2][10]
Optimizations focus on reducing traversal costs and lock contention, such as batching multiple allocations or deallocations to process freelists in bulk before updating cache-wide structures, and efficiently handling transitions between full, partial, and empty slab states via sorted lists or reference counts. For example, SLUB batches freelist transfers to per-CPU structures to avoid frequent locking of the struct page, while the original allocator uses simple pointer swaps for constant-time pushes and pops, reclaiming fully free slabs only under memory pressure. These techniques ensure scalable performance, with Linux SLAB's array-based indexing allowing O(1) free object lookup regardless of slab fullness.[1][2][10]
