Hubbry Logo
search
logo
Cgroups
Cgroups
current hub

Cgroups

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

cgroups
Original authorsv1: Paul Menage, Rohit Seth, Memory Controller by Balbir Singh, CPU controller by Srivatsa Vaddagiri
v2: Tejun Heo
DevelopersTejun Heo, Johannes Weiner, Michal Hocko, Waiman Long, Roman Gushchin, Chris Down et al.
Initial release2007; 18 years ago (2007)
Written inC
Operating systemLinux
TypeSystem software
LicenseGPL and LGPL
WebsiteCgroup v1, Cgroup v2

cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, etc.)[1]: § Controllers  of a collection of processes.

Engineers at Google started the work on this feature in 2006 under the name "process containers".[2] In late 2007, the nomenclature changed to "control groups" to avoid confusion caused by multiple meanings of the term "container" in the Linux kernel context, and the control groups functionality was merged into the Linux kernel mainline in kernel version 2.6.24, which was released in January 2008.[3] Since then, developers have added controllers for the kernel's own memory allocation,[4] netfilter firewalling,[5] the OOM killer,[6] and many other parts.

A major change in the history of cgroups is cgroup v2, which removes the ability to use multiple process hierarchies and to discriminate between threads as found in the original cgroup (now called "v1").[1]: § Issues with v1 and Rationales for v2  Work on the single, unified hierarchy started with the repurposing of v1's dummy hierarchy as a place for holding all controllers not yet used by others in 2014.[7] cgroup v2 was merged in Linux kernel 4.5 (2016).[8]

Versions

[edit]

There are two versions of cgroups. They can co-exist in a system.

  • The original version of cgroups was written by Paul Menage and Rohit Seth. It was merged into the mainline Linux kernel in 2007 (2.6.2). Development and maintenance of cgroups was then taken over by Tejun Heo, who instituted major redesigns without breaking the interface (see § Redesigns of v1). It was renamed "Control Group version 1" (cgroup-v1) after cgroups-v2 appeared in Linux 4.5.[9]
  • Tejun Heo found that further redesign of v1 could not proceed without breaking the interface. As a result, he added a separate, new system called "Control Group version 2" (cgroup-v2). Unlike v1, cgroup v2 has only a single process hierarchy (because a controller can only be assigned to one hierarchy, processes in separate hierarchies cannot be managed by the same controller; this change sidesteps the issue). It also removes the ability to discriminate between threads, choosing to work on a granularity of processes instead (disabling an "abuse" of the system which led to convoluted APIs).[1]: § Issues with v1 and Rationales for v2  The first version of the unified hierarchy The document first appeared in Linux kernel 4.5 released on 14 March 2016.[8]

Features

[edit]

One of the design goals of cgroups is to provide a unified interface to many different use cases, from controlling single processes (by using nice, for example) to full operating system-level virtualization (as provided by OpenVZ, Linux-VServer or LXC, for example). Cgroups provides:

Resource limiting
groups can be set not to exceed a configured memory limit, which also includes the file system cache,[10][11] I/O bandwidth limit,[12] CPU quota limit,[13] CPU set limit,[14] or maximum open files.[15]
Prioritization
some groups may get a larger share of CPU utilization[16] or disk I/O throughput[17]
Accounting
measures a group's resource usage, which may be used, for example, for billing purposes[18]
Control
freezing groups of processes, their checkpointing and restarting[18]

Use

[edit]
As an example of indirect usage, systemd assumes exclusive access to the cgroups facility

A control group (abbreviated as cgroup) is a collection of processes that are bound by the same criteria and associated with a set of parameters or limits. These groups can be hierarchical, meaning that each group inherits limits from its parent group. The kernel provides access to multiple controllers (also called subsystems) through the cgroup interface;[3] for example, the "memory" controller limits memory use, "cpuacct" accounts CPU usage, etc.

Control groups can be used in multiple ways:

  • By accessing the cgroup virtual file system manually.
  • By creating and managing groups on the fly using tools like cgcreate, cgexec, and cgclassify (from libcgroup).
  • Through the "rules engine daemon" that can automatically move processes of certain users, groups, or commands to cgroups as specified in its configuration.
  • Indirectly through other software that uses cgroups, such as Docker, Firejail, LXC,[19] libvirt, systemd, Open Grid Scheduler/Grid Engine,[20] and Google's developmentally defunct lmctfy.

The Linux kernel documentation contains some technical details of the setup and use of control groups version 1[21] and version 2.[1]

Interfaces

[edit]

Both versions of cgroup act through a pseudo-filesystem (cgroup for v1 and cgroup2 for v2). Like all filesystems they can be mounted on any path, but the general convention is to mount one of the versions (generally v2) on /sys/fs/cgroup under the sysfs default location of /sys. As mentioned before the two cgroup versions can be active at the same time; this too applies to the filesystems so long as they are mounted to a different path.[21][1] For the description below we assume a setup where the v2 hierarchy lies in /sys/fs/cgroup. The v1 hierarchy, if ever required, will be mounted at a different location.

At initialization cgroup2 should have no defined control groups except the top-level one. In other words, /sys/fs/cgroup should have no directories, only a number of files that control the system as a whole. At this point, running ls /sys/fs/cgroup could list the following on one example system:

  • cgroup.controllers
  • cgroup.max.depth
  • cgroup.max.descendants
  • cgroup.pressure
  • cgroup.procs
  • cgroup.stat
  • cgroup.subtree_control
  • cgroup.threads
  • cpu.pressure
  • cpuset.cpus.effective
  • cpuset.cpus.isolated
  • cpuset.mems.effective
  • cpu.stat
  • cpu.stat.local
  • io.cost.model
  • io.cost.qos
  • io.pressure
  • io.prio.class
  • io.stat
  • irq.pressure
  • memory.numa_stat
  • memory.pressure
  • memory.reclaim
  • memory.stat
  • memory.zswap.writeback
  • misc.capacity
  • misc.current
  • misc.peak

These files are named according to the controllers that handle them. For example, cgroup.* deal with the cgroup system itself and memory.* deal with the memory subsystem. Example: to request the kernel to 1 gigabyte of memory from anywhere in the system, one can run echo "1G swappiness=50" > /sys/fs/cgroup/memory.reclaim.[1]

To create a subgroup, one simply creates a new directory under an existing group (including the top-level one). The files corresponding to available controls for this group are automatically created.[1] For example, running mkdir /sys/fs/cgroup/example; ls /sys/fs/cgroup/example would produce a list of files largely similar to the one above, but with noticeable changes. On one example system, these files are added:

  • cgroup.events
  • cgroup.freeze
  • cgroup.kill
  • cgroup.type
  • cpu.idle
  • cpu.max
  • cpu.max.burst
  • cpu.pressure
  • cpu.uclamp.max
  • cpu.uclamp.min
  • cpu.weight
  • cpu.weight.nice
  • memory.current
  • memory.events
  • memory.events.local
  • memory.high
  • memory.low
  • memory.max
  • memory.min
  • memory.oom.group
  • memory.peak
  • memory.swap.current
  • memory.swap.events
  • memory.swap.high
  • memory.swap.max
  • memory.swap.peak
  • memory.zswap.current
  • memory.zswap.max
  • pids.current
  • pids.events
  • pids.events.local
  • pids.max
  • pids.peak

These changes are not unexpected because some controls and statistics only make sense on a subset of processes (e.g. nice level being the CPU priority of processes relative to the rest of the system).[1]

Processes are assigned to subgroups by writing to /proc/<PID>/cgroup. The cgroup a process is in can be found by reading the same file.[1]

On systems based on systemd, a hierarchy of subgroups is predefined to encapsulate every process directly and indirectly launched by systemd under a subgroup: the very basis of how systemd manages processes. An explanation of the nomenclature of these groups can be found in the Red Hat Enterprise Linux 7 manual.[22] Red Hat also provides a guide on creating a systemd service file that causes a process to run in a separate cgroup.[23]

systemd-cgtop[24] command can be used to show top control groups by their resource usage.

V1 coexistence

[edit]

On a system with v2, v1 can still be mounted and given access to controllers not in use by v2. However, a modern system typically already places all controllers in use in v2, so there is no controller available for v1 at all even if a hierarchy is created. It is possible to clear all uses of a controller from v2 and hand it to v1, but moving controllers between hierarchies after the system is up and running is cumbersome and not recommended.[1]

Major evolutions

[edit]

Redesigns of v1

[edit]

Redesign of cgroups started in 2013,[25] with additional changes brought by versions 3.15 and 3.16 of the Linux kernel.[26][27][28]

The following changes concern the kernel before 4.5/4.6, i.e. when cgroups-v2 were added. In other words they describe how cgroups-v1 had been changed, though most of them have also been inherited into v2 (after all, v1 and v2 share the same codebase).

Namespace isolation

[edit]

While not technically part of the cgroups work, a related feature of the Linux kernel is namespace isolation, where groups of processes are separated such that they cannot "see" resources in other groups. For example, a PID namespace provides a separate enumeration of process identifiers within each namespace. Also available are mount, user, UTS (Unix Time Sharing), network and SysV IPC namespaces.

  • The PID namespace provides isolation for the allocation of process identifiers (PIDs), lists of processes and their details. While the new namespace is isolated from other siblings, processes in its "parent" namespace still see all processes in child namespaces—albeit with different PID numbers.[29]
  • Network namespace isolates the network interface controllers (physical or virtual), iptables firewall rules, routing tables etc. Network namespaces can be connected with each other using the "veth" virtual Ethernet device.[30]
  • "UTS" namespace allows changing the hostname.
  • Mount namespace allows creating a different file system layout, or making certain mount points read-only.[31]
  • IPC namespace isolates the System V inter-process communication between namespaces.
  • User namespace isolates the user IDs between namespaces.[32]
  • Cgroup namespace[33]

Namespaces are created with the "unshare" command or syscall, or as "new" flags in a "clone" syscall.[34]

The "ns" subsystem was added early in cgroups development to integrate namespaces and control groups. If the "ns" cgroup was mounted, each namespace would also create a new group in the cgroup hierarchy. This was an experiment that was later judged to be a poor fit for the cgroups API, and removed from the kernel.

Linux namespaces were inspired by the more general namespace functionality used heavily throughout Plan 9 from Bell Labs.[35]

Conversion to kernfs

[edit]

Kernfs was introduced into the Linux kernel with version 3.14 in March 2014, the main author being Tejun Heo.[36] One of the main motivators for a separate kernfs is the cgroups file system. Kernfs is basically created by splitting off some of the sysfs logic into an independent entity, thus easing for other kernel subsystems the implementation of their own virtual file system with handling for device connect and disconnect, dynamic creation and removal, and other attributes. This does not affect how cgroups is used, but makes maintaining the code easier.[37]

New features introduced during v1

[edit]

Kernel memory control groups (kmemcg) were merged into version 3.8 (2013 February 18; 12 years ago (18-02-2013)) of the Linux kernel mainline.[38][39][4] The kmemcg controller can limit the amount of memory that the kernel can utilize to manage its own internal processes.

Support for per-group netfilter setup was added in 2014.[5]

The unified hierarchy was added in 2014. It repurposes of v1's dummy hierarchy to hold all controllers not yet used by others. This changed dummy hierarchy would become the only available hierarchy in v2.[7]

Changes after v2

[edit]

Unlike v1, cgroup v2 has only a single process hierarchy and discriminates between processes, not threads.

cgroup awareness of OOM killer

[edit]

Linux Kernel 4.19 (October 2018) introduced cgroup awareness of OOM killer implementation which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload.[6]

Adoption

[edit]

Various projects use cgroups as their basis, including CoreOS, Docker (in 2013), Hadoop, Jelastic, Kubernetes,[40] lmctfy (Let Me Contain That For You), LXC (Linux Containers), systemd, Mesos and Mesosphere,[40] HTCondor, and Flatpak.

Major Linux distributions also adopted it such as Red Hat Enterprise Linux (RHEL) 6.0 in November 2010, three years before adoption by the mainline Linux kernel.[41]

On 29 October 2019, the Fedora Project modified Fedora 31 to use CgroupsV2 by default[42]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Control groups, commonly abbreviated as cgroups, are a Linux kernel feature that enables the organization of processes (and their future children) into hierarchical groups for the purpose of limiting, accounting, and isolating resource usage such as CPU time, memory, disk I/O, and network bandwidth.[1] This mechanism aggregates sets of tasks into tree-structured hierarchies, where each group can be associated with specific subsystems (also known as controllers) that enforce resource controls and provide usage statistics.[2] Originally developed by Google engineers Paul Menage and Rohit Seth to address process containerization needs, cgroups were proposed in 2006 and merged into the mainline Linux kernel starting with version 2.6.24 in early 2008.[3] The feature evolved through two major versions: cgroup v1, which supports multiple independent hierarchies and per-thread granularity but suffers from interface inconsistencies, and cgroup v2, introduced in Linux 4.5 in 2016, which unifies into a single hierarchy for improved consistency, delegation, and resource management without legacy thread-level controls.[4][1] Key controllers in both versions include those for CPU scheduling, memory limits, and I/O prioritization, allowing fine-grained allocation via a virtual filesystem interface under /sys/fs/cgroup.[2] Cgroups form the foundational resource control layer for container technologies like Docker and Kubernetes, enabling efficient virtualization and workload isolation in modern computing environments.[5]

Overview

Definition and Purpose

Control groups, commonly known as cgroups, are a Linux kernel feature that organizes processes into hierarchical groups to limit, account for, and isolate the usage of system resources such as CPU time, memory, disk I/O, and network bandwidth for collections of tasks.[2][1] This subsystem aggregates sets of tasks and their future children into groups, associating them with specific parameters that define behavior for various resource controllers.[2] The primary purpose of cgroups is to enable precise resource allocation and management in environments requiring isolation, such as containers and virtualization technologies, by preventing any single process or user from monopolizing system resources.[2] They facilitate workload isolation in multi-tenant systems, where multiple applications or users share the same kernel, ensuring that resource demands from one group do not adversely affect others.[1] This capability supports broader containerization efforts by providing the foundational mechanisms for bounding and prioritizing resource consumption.[6] Key benefits of cgroups include enhanced system stability through enforced limits that mitigate denial-of-service risks from resource-intensive tasks, promotion of fair resource sharing among competing groups, and improved overall efficiency in resource utilization, particularly in server and cloud environments.[2] Initially motivated by the need for process containerization, cgroups were developed by Google engineers in 2006–2007 under the name "process containers" to underpin projects like Linux Containers (LXC), addressing the limitations of earlier resource management approaches in handling dynamic workloads.[6][7]

Historical Development

The development of control groups, commonly known as cgroups, originated in 2006 at Google, where engineers Paul Menage and Rohit Seth led the initial work under the name "process containers" to support resource isolation for container-like environments.[8] This effort addressed the need for fine-grained resource control in large-scale computing, building on existing kernel mechanisms like cpusets.[9] The project was renamed cgroups shortly thereafter and merged into the mainline Linux kernel as version 1 in the 2.6.24 release in early 2008, marking its availability for upstream adoption.[1] Early adoption of cgroups v1 focused on container technologies, with integration into Linux Containers (LXC) starting around 2008, where it combined with kernel namespaces to enable full OS-level virtualization.[10] By 2009, as additional controllers for resources like memory and I/O were added and refined, cgroups v1 achieved sufficient stability for production use in distributions and tools, paving the way for broader ecosystem support including later projects like Docker in 2013.[3] Paul Menage served as the primary maintainer during this formative period until 2011, when responsibilities transitioned to Tejun Heo, who oversaw subsequent redesigns and maintenance.[11] Key milestones included the experimental introduction of cgroups v2 in kernel 3.16 in 2014, featuring a unified hierarchy to address v1's limitations in scalability and consistency.[12] This version reached production readiness in kernel 4.5 in 2016, with default enablement options emerging in subsequent releases.[8] Refinements continued into 2025, enhancing features like delegation for unprivileged users in v2 hierarchies. Post-2020 updates bolstered the IO controller with improved weight-based throttling and cost modeling starting in kernel 5.1, while Pressure Stall Information (PSI)—initially added in 4.20—matured through better integration in container runtimes and orchestrators, enabling proactive resource pressure detection by 2024.[13][14]

Core Concepts

Hierarchy Structure

Control groups (cgroups) are organized in a hierarchical structure that forms the foundation for resource management in the Linux kernel. In cgroup version 1 (v1), the system supports multiple independent hierarchies, often described as a forest, where each hierarchy is a tree of cgroups dedicated to one or more controllers. Every process belongs to exactly one cgroup per hierarchy, and the root cgroup of each hierarchy initially contains all tasks on the system. Child cgroups inherit resource limits and accounting from their parents, ensuring that constraints propagate downward in the tree.[2] In contrast, cgroup version 2 (v2) employs a single unified hierarchy, simplifying the organization into one tree where all controllers operate within the same structure. This unified approach ensures consistent views of processes across controllers, with the root cgroup at the top level exempt from direct resource control but serving as the parent for all others. Processes inherit their parent's cgroup membership upon creation via fork, and resource distributions follow a top-down model where a child cgroup can only allocate resources it has received from its parent.[4] The hierarchies are exposed through a pseudo-filesystem mounted under /sys/fs/cgroup. For v1, the cgroup filesystem (cgroupfs) is mounted with options specifying controllers, such as mount -t cgroup -o cpuset,[memory](/page/Memory) none /sys/fs/cgroup/cpuset. For v2, the cgroup2 filesystem is mounted as mount -t cgroup2 none /sys/fs/cgroup/unified, providing a single mount point for the unified hierarchy. The root cgroup resides at this mount point, with subdirectories representing child cgroups.[2][4] A key feature of the hierarchy is delegation, which allows non-root users to manage sub-hierarchies without system-wide privileges. In v1, delegation relies on file permissions, enabling users to create, modify, and move processes within permitted cgroups by writing to files like tasks, though containment is less strict. In v2, delegation is more robust: users gain control by setting ownership or permissions on files such as cgroup.procs, cgroup.threads, and cgroup.subtree_control, while the nsdelegate mount option enforces boundaries using cgroup namespaces to prevent unauthorized process migrations outside the delegated subtree. This option is set system-wide on mount from the init namespace, treating namespaces as delegation limits.[1][4] For illustration, consider a simple hierarchy tree in v2: the root cgroup (/sys/fs/cgroup) branches to a user-specific cgroup (e.g., /sys/fs/cgroup/user.slice), which further divides into process groups (e.g., /sys/fs/cgroup/user.slice/app1 and /sys/fs/cgroup/user.slice/app2). Processes launched under user.slice inherit limits from the root and user levels, allowing isolated resource management for applications without affecting the broader system.[4]

Controllers and Resources

Control groups (cgroups) utilize controllers, also known as subsystems, to manage and limit specific types of system resources allocated to groups of processes. Each controller handles a distinct resource domain, such as CPU time or memory usage, and operates within the cgroup hierarchy to enforce policies like shares, limits, or protections. In cgroup version 2 (v2), controllers are integrated into a unified hierarchy, where they can be selectively enabled for subtrees via the cgroup.subtree_control file by appending names like "+cpu" or "+memory" to activate them for child cgroups.[4] The core controllers include the following, with their managed resources and purposes detailed below. This list reflects availability as of Linux kernel 6.17 (released in September 2025), encompassing both longstanding and newer additions. Recent additions include the dmem controller for device memory management, introduced in kernel 6.14 (June 2025).[4]
ControllerManaged ResourcesDescription
cpuCPU cycles and schedulingRegulates the distribution of CPU time among cgroups using a weight-based shares model for proportional allocation and a quota-based bandwidth model for hard limits on usage periods. It supports integration with the completely fair scheduler (CFS) for fair CPU sharing.[15]
memoryRAM, swap, and kernel memoryTracks and limits memory usage, including user-space allocations, kernel data structures, and TCP buffers, while providing protection levels to prioritize cgroups during pressure and out-of-memory (OOM) scenarios. Usage is accounted hierarchically to prevent double-counting.[16]
ioBlock device I/O bandwidth and operationsManages I/O resources on block devices through weight-based proportional sharing and absolute limits on bytes or I/O operations per second (IOPS), unifying the v1 blkio controller's functionality with improved hierarchical accounting. Available since the initial cgroup v2 release in Linux kernel 4.5 (2016).[17]
blkioBlock I/O (v1-specific)In cgroup v1, controls block device I/O throughput and weights for proportional bandwidth allocation, serving as the predecessor to the v2 io controller; it supports per-device rules but lacks v2's unified hierarchy.
devicesDevice file accessEnforces allow/deny rules for access to device nodes (e.g., /dev/null) using Berkeley Packet Filter (BPF) programs, preventing unauthorized operations like read/write on specific major:minor device pairs. In v2, it relies on eBPF for flexible policy definition.[18]
pidsProcess and thread countsLimits the number of tasks (processes or threads) that can be created within a cgroup via fork() or clone(), accounting for both direct and threaded modes to prevent fork bombs; it provides current usage tracking and a maximum limit. Available since the initial cgroup v2 release in kernel 4.5 (2016).[19]
rdmaRemote Direct Memory Access (RDMA) resourcesAccounts for and limits RDMA/InfiniBand hardware resources, such as host channel adapter (HCA) handles and queue pairs, enabling fair sharing among cgroups in high-performance computing environments. Ported to v2 from v1 and available since Linux kernel 4.11 (2017).[20]
hugetlbHuge page memoryLimits the usage of huge TLB pages per cgroup, enforced during allocation to manage large memory pages for performance-critical applications. Available since the initial cgroup v2 release in kernel 4.5 (2016).[21]
miscMiscellaneous scalar resourcesProvides a generic interface for limiting and accounting various scalar resources registered by kernel subsystems, such as RDMA-specific or other non-standard resources. Available since Linux kernel 5.13 (2021).[22]
dmemDevice memoryRegulates the allocation and usage of device-specific memory, such as GPU video RAM, to prevent overcommitment and enable fair sharing in heterogeneous computing environments. Introduced in Linux kernel 6.14 (2025).[23]
net_clsNetwork packet classification (v1-specific)In cgroup v1, tags network packets with class IDs for traffic control (tc) integration, allowing classification based on cgroup membership; not fully ported to v2, where network management relies on other mechanisms.
net_prioNetwork priority (v1-specific)In cgroup v1, sets priority levels for outgoing network traffic per cgroup, influencing socket buffer prioritization; similar to net_cls, it is primarily a v1 feature without direct v2 equivalent.
These controllers can be mounted and enabled collectively in v2 by mounting the cgroup2 filesystem (e.g., mount -t cgroup2 none /sys/fs/cgroup) and specifying desired ones in the root's cgroup.subtree_control file, such as echo "+cpu +memory +io" > cgroup.subtree_control. This approach ensures only relevant resources are delegated down the hierarchy, integrating seamlessly with the overall tree structure. Additional controllers like cpuset (for CPU/node affinity) and perf_event (for performance monitoring) exist but are outside the primary focus here.[24][8]

Versions

Version 1 Details

Control Groups version 1 (cgroups v1) implements a flexible but complex architecture centered around multiple independent hierarchies, each typically dedicated to a single resource controller or subsystem. In this design, each controller—such as CPU, memory, or block I/O—operates within its own separate hierarchy, which must be mounted as a distinct filesystem instance under /sys/fs/cgroup. For example, the CPU controller is mounted at /sys/fs/cgroup/cpu, while the memory controller uses /sys/fs/cgroup/memory, allowing administrators to apply different grouping policies for different resources without interference.[2] This multi-hierarchy approach enables fine-grained control but requires managing multiple mount points and can lead to administrative overhead. Tasks, or processes, are assigned to groups within a hierarchy by writing their process ID (PID) to the tasks file in the target cgroup directory, such as echo > /sys/fs/cgroup/cpu/tasks, which moves the entire process into that group across all threads.[2] Despite its capabilities, cgroups v1 exhibits several key limitations that affect usability and consistency. Delegation of control to non-root users is inconsistent across controllers, as some subsystems support threaded delegation while others do not, complicating containerized environments where subtrees need to be managed by unprivileged users.[2] Additionally, each controller exposes its own set of configuration files unique to its subsystem, resulting in a fragmented interface that varies by resource type and lacks a unified namespace for properties. Some features, such as memory pressure notifications, lack per-process granularity and operate only at the cgroup level, limiting precise monitoring and control.[2] Coexistence with cgroups v2 is possible through hybrid setups, but this introduces complexities like restricted remounting of v1 hierarchies, with kernel support for such operations slated for removal in future releases.[2] cgroups v1 includes several features that are either unique to it or behave differently compared to later versions, providing specialized resource management options. The freezer controller allows administrators to suspend or resume entire groups of tasks by transitioning them between frozen and thawed states via the freezer.state file, enabling coordinated pausing of processes for maintenance or checkpointing without affecting the entire system. Similarly, the cpuset controller facilitates CPU and memory node affinity by restricting tasks in a cgroup to specific processors or NUMA nodes, configured through files like cpuset.cpus and cpuset.mems, which is particularly useful for performance tuning in multi-core or distributed-memory environments. Regarding its lifecycle, cgroups v1 has been progressively deprecated in favor of version 2, with the Linux kernel recommending boot-time disablement of v1 hierarchies since version 4.15 released in early 2018 to encourage adoption of the unified model. This recommendation aligns with the introduction of the cgroup_no_v1 boot parameter in kernel 5.0, which allows explicit disabling of v1 named hierarchies (e.g., cgroup_no_v1=all).[1] As of 2025, discussions on full removal of v1 code from the kernel continue, including proposals for deprecation warnings and phased elimination, driven by maintainers like Tejun Heo amid broader ecosystem shifts such as systemd 258 dropping v1 support.[25]

Version 2 Improvements

Cgroups version 2 introduces a unified hierarchy design, where all controllers are organized under a single tree structure, contrasting with the multiple independent hierarchies of version 1. This unification enables consistent resource distribution across the system and facilitates delegation of sub-hierarchies to less privileged users or namespaces without risking inconsistencies in resource accounting.[4] The single hierarchy also supports thread-level granularity for certain controllers, such as CPU and PIDs, allowing threads within a process to be controlled independently via the cgroup.threads file, which lists and permits migration of threads to other cgroups.[4] Among the new capabilities, the PIDs controller limits the number of processes and threads that can be created within a cgroup, preventing fork bombs and aiding in resource isolation; for example, setting pids.max to 100 restricts the cgroup to no more than 100 tasks.[4] Memory accounting is enhanced with tiered limits: memory.low reserves a minimum amount of memory for the cgroup to avoid aggressive reclamation, memory.high acts as a soft limit triggering pressure stall information (PSI) when exceeded without immediate termination, and memory.max enforces a hard limit leading to out-of-memory kills if breached.[4] The I/O controller is unified under a single interface, supporting weight-based throttling (io.weight) and maximum bandwidth limits per device (io.max), which simplifies configuration compared to the fragmented blkio and iothrottle controllers in version 1.[4] Cgroups v2 has become the default in modern Linux distributions, including Fedora since version 31 (2019), Ubuntu since 21.10 (2021), and Debian 11 (2021), reflecting its maturity and improved stability.[26] For systems requiring coexistence with version 1, a hybrid mode is supported by mounting specific controllers to legacy hierarchies while using the unified v2 mount point, often enabled via the kernel boot parameter cgroup_no_v1=all to disable v1 entirely or selectively.[4] Performance benefits stem from the unified mounting, which reduces kernel overhead in managing multiple filesystem instances and improves scalability for large hierarchies with thousands of cgroups; for instance, dynamic operations like task migrations incur lower latency when using the favordynmods mount option introduced in kernel 4.15.[4] Starting with kernel 5.15 (2021), enhancements to delegation allow unprivileged users to more reliably manage sub-hierarchies without root privileges, provided the cgroup is properly owned and permissions are set, enhancing security in containerized environments. PSI, integrated since kernel 4.20, receives further refinements in later kernels like 5.15, providing per-cgroup metrics on CPU, memory, and I/O pressure to better detect and mitigate bottlenecks before they impact performance.

Features and Capabilities

Resource Limiting and Control

Control groups (cgroups) provide mechanisms to enforce resource limits and quotas on groups of processes, ensuring predictable resource usage in multi-tenant environments. These limits are categorized into hard limits, which impose strict maximums that cannot be exceeded; soft limits, which serve as preferred thresholds for proactive management; and shares, which enable proportional allocation based on relative weights. For instance, in cgroup v1, the memory controller uses memory.limit_in_bytes for a hard limit on memory usage and memory.high as a soft limit that triggers throttling when approached.[27] In cgroup v2, these are refined with memory.max for hard limits and memory.high for soft throttling to prevent excessive pressure.[16] Similarly, CPU shares are set via cpu.shares in v1 or cpu.weight (ranging from 1 to 10000, default 100) in v2 to allocate resources proportionally among competing cgroups using weighted fair queuing.[28][15] Enforcement occurs at the kernel level to prevent resource overcommitment by default, integrating with core subsystems for immediate intervention. For CPU resources, the Completely Fair Scheduler (CFS) throttles tasks exceeding quotas, ensuring fair distribution without allowing bursts beyond allocated shares.[29] Memory enforcement involves direct reclamation attempts followed by invocation of the Out-of-Memory (OOM) killer if usage hits the hard limit and cannot be reduced, targeting processes within the cgroup to free memory.[16] I/O limiting uses device-specific throttling to cap bandwidth or operations, avoiding global impacts from misbehaving workloads.[2] Practical examples illustrate these controls in action. In cgroup v2, CPU quotas are configured by writing to cpu.max in the format "quota period" (in microseconds), such as "100000 200000" to limit a cgroup to 100ms of CPU time every 200ms for 50% utilization.[15] For I/O, v1's blkio controller sets throttling via blkio.throttle.read_bps_device to restrict read bytes per second on specific devices, while v2's io controller uses io.max for broader bandwidth and IOPS limits, e.g., capping reads at 2MB/s.[2][17] Advanced features enhance control through feedback and refined scheduling. Weighted fair queuing underlies CPU allocation, where higher weights grant larger shares during contention, integrated into the CFS for low-latency fairness.[29] Additionally, Pressure Stall Information (PSI), introduced in Linux kernel 4.20 in 2018, provides feedback via cgroup.pressure files (e.g., cpu.pressure, memory.pressure) that report stall times due to resource contention, enabling dynamic adjustments like load migration to avoid OOM events.[30]

Accounting and Monitoring

Control Groups (cgroups) provide accounting mechanisms to track resource consumption for groups of processes and their descendants, enabling administrators to monitor usage without enforcing limits. These mechanisms rely on kernel-maintained statistics files exposed in each cgroup directory, which report aggregated data from all tasks in the cgroup and its subtree. For instance, the memory controller exposes memory.current to show the total current memory usage in bytes, while the CPU controller provides cpu.stat with fields like usage_usec for total CPU time consumed and nr_throttled for the number of throttling periods when the completely fair scheduler is active.[4] In cgroups version 1 (v1), accounting is handled per-controller with separate hierarchies, where files such as memory.usage_in_bytes in the memory subsystem report usage for the cgroup and its children, aggregated hierarchically to reflect the tree structure. Version 2 (v2) unifies this into a single hierarchy, improving aggregation by ensuring stats like those in memory.current and cpu.stat inherently include contributions from all descendant cgroups without requiring manual summation. Event counts, such as io.stat in the IO controller for bytes read or written, further detail specific interactions like rbytes for read operations, providing counters for disk I/O without real-time guarantees unless paired with external polling tools.[2][4] Monitoring in cgroups integrates with Pressure Stall Information (PSI), a kernel feature introduced in version 4.20 that detects and reports resource contention by measuring the time tasks spend stalled waiting for CPU, memory, or I/O. PSI files like cpu.pressure, memory.pressure, and io.pressure are available in cgroup directories, tracking both "some" (partial stalls affecting some tasks) and "full" (complete stalls affecting all tasks) over averaging windows of 10s, 60s, and 300s, with hierarchical aggregation to show system-wide pressure from sub-cgroups. Full PSI support in cgroups v2, including accurate stall accounting across the unified hierarchy, was enabled starting with kernel 5.2.[30][4] A key improvement in v2 accounting is enhanced slab memory tracking, where the memory.stat file includes slab_reclaimable and slab_unreclaimable counters to distinguish reclaimable kernel slab allocations (like dentries) from permanent ones, providing a more complete view of kernel memory footprint per cgroup since kernel 5.2. These stats are exported to userspace primarily through the cgroup filesystem (cgroupfs) mounted at /sys/fs/cgroup, with process membership visible via /proc/$PID/cgroup, allowing tools to query and aggregate data for monitoring without direct kernel modifications. While cgroups offer no built-in real-time notifications, integration with netlink sockets enables event-based monitoring for changes in usage or pressure in advanced setups.[4]

Usage and Interfaces

Control Interfaces

The primary interface for interacting with control groups (cgroups) from userspace is the cgroup filesystem, mounted by default at /sys/fs/cgroup, which exposes a hierarchical directory structure where cgroups are represented as subdirectories and their properties as files.[2] Users can create, modify, and delete cgroups using standard filesystem operations like mkdir, rmdir, and file writes; for example, writing a process ID (PID) to the cgroup.procs file assigns that process to the cgroup, enabling resource control and monitoring.[4] Key files include cgroup.procs for listing and assigning processes (or thread groups in v1 via tasks), cgroup.subtree_control for enabling controllers in child cgroups (v2-specific), and controller-specific files like memory.max for setting limits.[1] In cgroups v1, the filesystem supports multiple hierarchies, each mounted separately for specific controllers (e.g., mount -t cgroup cpu /sys/fs/cgroup/cpu), allowing independent management but leading to complexity in overlapping controls.[2] Conversely, cgroups v2 employs a unified hierarchy mounted at a single point (e.g., mount -t cgroup2 none /sys/fs/cgroup/unified), integrating all controllers under one tree to simplify administration and ensure consistent resource delegation from parent to child cgroups.[4] This unified approach eliminates v1's per-controller mount requirements, with available controllers listed in the root's cgroup.controllers file.[1] Programmatic access is facilitated by libraries and tools such as libcg, a C library from the libcgroup package, which abstracts filesystem operations for creating and managing cgroups. Command-line utilities like cgcreate (to create cgroups) and cgexec (to execute processes within a cgroup) from the same package provide user-friendly wrappers, primarily for v1; while partial v2 support exists in recent versions (e.g., 3.0+ as of 2024), for cgroup v2 it is recommended to use the filesystem interface directly or tools like systemd-run, as full v2 compatibility is still evolving.[31] Systemd, as the default init system on many distributions, offers integrated cgroup management through its unit files and D-Bus APIs, automatically creating cgroups for services (e.g., via system.slice) and allowing resource limits like CPUQuota= to be set declaratively.[32] For delegation, units can enable subcgroup control with Delegate=yes, enabling finer-grained management within slices.[33] At the kernel level, task movement between cgroups is handled internally by functions such as cgroup_attach_task, invoked when userspace writes to cgroup.procs or equivalent files, ensuring atomic updates and permission checks.[2] In cgroups v2, a netlink socket interface supports event notifications, such as process migrations or controller state changes, allowing userspace applications to monitor hierarchy dynamics without polling the filesystem.[4] For systems transitioning to v2, coexistence with v1 is supported in hybrid mode, where unused v2 controllers can be rebound to legacy v1 hierarchies to maintain compatibility for applications relying on v1-specific behaviors, such as per-controller mounts.[4] This fallback ensures gradual migration, with systemd often managing the unified v2 tree while exposing v1 for legacy controllers like blkio.[1]

Configuration Methods

Configuration of control groups (cgroups) can occur at boot time through kernel parameters or at runtime via filesystem operations and tools. Boot-time settings primarily control the hierarchy type and available controllers, ensuring compatibility with system management daemons like systemd. As of 2024, major distributions and init systems like systemd default to cgroup v2, with v1 support deprecated and removed in systemd 258 (September 2025). Container technologies such as Kubernetes have placed v1 in maintenance mode.[34][35] To enable legacy cgroup v1 support on systems defaulting to v2, kernel boot parameters like systemd.legacy_systemd_cgroup_controller=yes can be used for hybrid mode. Previously, to enable a unified cgroup v2 hierarchy exclusively, the kernel boot parameter cgroup_no_v1=all disables all v1 controllers, forcing all to use v2 (as of systemd 256, June 2024). Alternatively, systemd.unified_cgroup_hierarchy=1 activates the unified hierarchy when systemd is present, without fully disabling v1. These parameters are added to the kernel command line; for example, on systems using GRUB, edit /etc/default/grub to append them to GRUB_CMDLINE_LINUX_DEFAULT, then run update-grub to apply changes across boots.[4][36] At runtime, cgroups are managed through the cgroup filesystem, typically mounted at /sys/fs/cgroup. To create a new cgroup, use mkdir in the appropriate hierarchy directory, such as mkdir /sys/fs/cgroup/mygroup for v2.[1] Processes are assigned by writing their PID to the cgroup.procs file: echo <PID> > /sys/fs/cgroup/mygroup/cgroup.procs.[4] Resource limits are set by writing to controller-specific files, like echo 50000 100000 > cpu.max for CPU limits in microseconds.[4] For scripted management, the libcgroup-tools package provides utilities like cgcreate to create cgroups and cgset to configure parameters. For instance, cgcreate -g cpu:/cpulimited creates a CPU cgroup, followed by cgset -r cpu.shares=512 cpulimited to allocate half the default shares.[37] A basic script for a CPU-limited group might look like this:
#!/bin/bash
cgcreate -g cpu:/limited
cgset -r cpu.shares=256 limited  # Limits to about 25% on a 4-core system
cgexec -g cpu:limited stress --cpu 4 --timeout 60s
This creates the group, sets shares, and runs a workload within it (v1 example). To enable delegation, allowing non-root users to manage child cgroups, write to cgroup.subtree_control in the parent, e.g., echo "+cpu" > /sys/fs/cgroup/user.slice/cgroup.subtree_control. This permits enabling the CPU controller in subdirectories.[4] Additional tools include cgclassify for reclassifying running processes into cgroups, such as cgclassify -g cpu:/limited <PID>, and systemd-run for ad-hoc cgroups without persistent setup: systemd-run --scope -p CPUShares=256 stress --cpu 4.[38][39] Troubleshooting mount issues often involves verifying the cgroup filesystem is mounted with mount | [grep](/page/Grep) cgroup; if absent, mount manually with mount -t cgroup2 none /sys/fs/cgroup for v2, ensuring controllers are enabled via kernel parameters if needed.[40] Common errors like "no cgroup mount found" arise from mismatched v1/v2 configurations or disabled controllers.[41]

Evolution and Transitions

v1 Redesigns and Enhancements

During the evolution of control groups version 1 (cgroups v1), several redesigns and enhancements were introduced to address scalability limitations, improve resource accounting, and mitigate operational challenges, primarily between 2013 and 2014. One key redesign was the conversion of the cgroup filesystem from the custom cgroupfs to kernfs, completed in Linux kernel 3.15 (released June 2014). This shift leveraged a unified virtual filesystem framework shared with sysfs, significantly enhancing scalability by optimizing directory traversal, reducing lock contention, and lowering memory usage in environments with thousands of cgroups.[42] Another important enhancement was the addition of namespace isolation for cgroups, introduced in Linux kernel 4.6 (March 2016), which allowed cgroups to be scoped to individual namespaces. This feature enabled processes in different namespaces to maintain isolated views of the cgroup hierarchy, preventing cross-namespace visibility and improving security in containerized setups without affecting the global structure.[1] Experiments with unified hierarchies began in 2013, as discussed at the Linux Kernel Summit, where developers explored consolidating multiple controller-specific hierarchies into a single structure to simplify management and reduce inconsistencies in process classification across controllers.[43] New features in v1 included extensions to the blkio controller for writeback support, merged in Linux kernel 4.2 (August 2015), which extended I/O throttling to buffered write operations, ensuring accurate accounting and limiting of dirty page writebacks per cgroup. Refinements to the memsw (memory plus swap) interface in the memory controller, around kernel 3.15, improved swap usage tracking by better integrating swap limits with memory pressure notifications, allowing more reliable enforcement of combined memory and swap caps. The perf_event controller, initially added in kernel 2.6.39 (April 2011) for basic performance event monitoring, saw expansions in kernel 3.14 (March 2014) to integrate more tightly with the core cgroup framework, enabling hierarchical aggregation of perf events like CPU cycles and cache misses for grouped processes.[1][44] Despite these advances, challenges persisted in v1, particularly with delegation inconsistencies where file permission-based delegation led to varying behaviors across controllers, such as mismatched support for subdirectory creation or process movement. Partial fixes were applied in subsequent kernels, like improved permission checks in 3.15, but full resolution required v2's domain-based delegation. Additionally, the cgroup release agent was refined as a mechanism for automated cleanup; configured via the release_agent file in the root cgroup, it executes a user-defined script when a non-root cgroup becomes empty, aiding in resource reclamation and hierarchy maintenance.[2]

Migration to v2

To migrate from cgroups v1 to v2, the primary step involves enabling the unified v2 hierarchy by mounting the cgroup2 filesystem at the root location, typically via the command mount -t cgroup2 none /sys/fs/cgroup.[4] This establishes a single hierarchy for all controllers, replacing the multiple v1 hierarchies. Existing v1 hierarchies mounted under /sys/fs/cgroup can then be converted by unmounting them and remounting the v2 filesystem, with processes migrated using the cgroup.procs file in the target v2 cgroup to move PIDs from v1 to v2 structures.[4] For legacy support during transition, v2 offers a compatibility mode that allows hybrid setups where unavailable v1 controllers can be mounted alongside v2, though this is not recommended for full adoption as it maintains fragmentation.[4] Key tools facilitate the migration process. Systemd enables v2 by default in unified mode when the kernel command line includes systemd.unified_cgroup_hierarchy=1, which automates hierarchy conversion during boot on supported systems. For container environments, tools like crictl (part of CRI-tools) allow inspection and management of v2 cgroups in CRI-compatible runtimes such as containerd v1.4+, enabling verification of container paths under /sys/fs/cgroup post-migration.[5] Distributions like Fedora have included automatic migration scripts since Fedora 31 (released in 2019), which detect and switch to v2 on upgrade while handling Docker and other legacy tools via temporary v1 fallbacks.[45] Migration requires careful consideration of controller compatibility changes. For instance, the v1 freezer controller is replaced in v2 by the cgroup.freeze interface file, which suspends or thaws all tasks in a cgroup by writing 1 or 0, respectively, rather than using separate freezer-specific files.[4] Process management shifts to the unified cgroup.procs file for migrations, which lists and allows writing PIDs to move tasks across cgroups without affecting descendants.[4] Additionally, out-of-memory (OOM) behavior differs; v2 introduces the memory.oom.group knob, which, when enabled, directs the OOM killer to terminate the entire cgroup instead of individual processes, potentially altering application reliability and requiring testing for workloads sensitive to group-wide kills.[4] The benefits of migration include reduced system complexity through a single hierarchy and improved delegation for unprivileged users, enabling safer containerization without root privileges.[4] However, pitfalls arise from the need to update applications and tools reliant on v1-specific interfaces, as not all v1 controllers (e.g., certain legacy ones like blkio) are fully ported, potentially causing compatibility breaks during transition.[1] Full v2 support became available in Linux kernel 5.0 (released in 2019), with subsequent kernels enhancing stability.[46] Recent distribution trends reflect widespread adoption: Fedora 31+ (2019), Ubuntu 21.10+ (2021), Debian 11+ (2021, including Debian 12 in 2023), and RHEL 9 (2022) now default to v2, often with automated boot-time enabling via systemd.[45][47]

Adoption and Integration

Use in Container Technologies

Control groups (cgroups) form the foundational mechanism for resource isolation and management in container technologies, enabling runtimes to enforce limits on CPU, memory, and other resources to prevent any single container from starving the host system. Docker, introduced in 2013, relies on cgroups as a core component for container resource constraints, mapping command-line flags such as --memory to set hard memory limits (e.g., 300m for 300 MiB) and --cpus to restrict CPU shares (e.g., 1.5 CPUs on a multi-core host) directly to corresponding cgroup filesystem entries like memory.limit_in_bytes and cpu.cfs_quota_us.[48] Similarly, LXC uses cgroups to allocate and limit resources for containers, integrating them with namespaces for process isolation and ensuring controlled access to host resources such as CPU time and memory usage.[49] Podman, a daemonless alternative to Docker, employs cgroups by default via the --cgroups=enabled option, creating new cgroups under a specified parent path to manage container resource limits and support both v1 and v2 hierarchies.[50] In orchestration platforms like Kubernetes, cgroups underpin pod-level resource quotas through the ResourceQuota API, which imposes namespace-wide limits on aggregate CPU and memory consumption enforced by the container runtime's cgroup configurations. For instance, a ResourceQuota can cap total memory at 1Gi across all pods in a namespace, with the kubelet instructing the runtime to apply these via cgroups to avoid host resource exhaustion.[51][52] CRI-O, a lightweight Kubernetes runtime, provides direct support for cgroup v2 starting with version 1.20 in late 2020, allowing unified resource delegation and improved hierarchical management for pods.[5] As of August 2024, Kubernetes version 1.31 placed cgroup v1 support in maintenance mode, promoting full adoption of v2 for enhanced resource management.[34] Practical examples of cgroup application in containers include isolating CPU and memory to mitigate denial-of-service risks; for a memory-limited container, exceeding the cgroup-set threshold triggers the kernel's out-of-memory killer, terminating the process while preserving host stability.[48] Additionally, the net_cls controller tags network packets from a container's cgroup with a class identifier (e.g., writing 0x100001 to net_cls.classid), enabling integration with network namespaces and the Linux traffic control (tc) utility for quality-of-service shaping, such as prioritizing container traffic.[53] The evolution toward cgroup v2 in container ecosystems enhances delegation and simplifies hierarchies, with containerd adopting support in version 1.4 released in August 2020 to facilitate better subtree control and reduced overhead in multi-tenant environments.[54] This shift allows runtimes like containerd to delegate entire cgroup subtrees to containers, improving scalability for orchestration tools like Kubernetes. However, the global cgroup_mutex, which serves as the master lock for any modifications to cgroups or their hierarchies, can experience contention during frequent container creation and destruction operations, potentially impacting performance in high-load scenarios.[2]

Integration with System Management Tools

Systemd, the default init system in most modern Linux distributions, has integrated cgroups for resource management since version 205 in 2013, enabling automatic grouping of processes launched by services, scopes, and slices.[32] Slices organize hierarchical groupings, such as user.slice for per-user resource limits, while services and scopes map directly to cgroup paths for precise control over CPU, memory, and I/O usage of managed processes.[32] This integration allows systemd to enforce limits declaratively in unit files, for example, setting MemoryMax=1G in a service definition to cap memory allocation.[33] Legacy init systems like Upstart provide cgroup support through job configuration files, where the cgroup stanza assigns processes to specific hierarchies, such as cgroup cpu /sys/fs/cgroup/cpu/tasks for CPU shares.[55] Supervisor, a process control system, can extend cgroup functionality via third-party plugins or custom scripts to monitor and limit resources for supervised programs, though it lacks native hierarchical delegation.[56] In Android, the low-memory killer (LMK) has utilized memory cgroups for out-of-memory (OOM) handling since kernel integration around 2012, prioritizing process termination based on cgroup pressure notifiers to maintain system responsiveness on resource-constrained devices.[57] Key features in systemd include dynamic delegation via the Delegate=yes directive in unit files, which permits services to create and manage sub-cgroups independently while inheriting parent limits.[32] CPU accounting is enabled per-service with the CPUAccounting=yes property, allowing runtime adjustments like systemctl set-property myservice.service CPUQuota=50% to throttle usage.[33] Starting with systemd 254 in late 2022, enhanced support for cgroup v2 includes improved pressure event handling, where memory pressure propagates up the tree for proactive service adjustments. This includes monitoring the unit's cgroup path, such as /sys/fs/cgroup/system.slice/myservice.service/[memory](/page/Memory).pressure, for stall events, enabling units to react to contention without full OOM invocation.[33] In May 2024, systemd 256 removed support for cgroup v1 hierarchies, aligning with major Linux distributions that default to v2 unified hierarchies.[58] Adoption of cgroup integration via systemd is widespread in enterprise environments, with Red Hat Enterprise Linux 8 and later (released 2019) relying on it for service isolation in production workloads, supporting features like per-user slicing to prevent resource exhaustion in multi-tenant setups.[59]

References

User Avatar
No comments yet.