Recent from talks
Nothing was collected or created yet.
OS-level virtualization
View on WikipediaThis article needs additional citations for verification. (November 2020) |
OS-level virtualization is an operating system (OS) virtualization paradigm in which the kernel allows the existence of multiple isolated user space instances, including containers (LXC, Solaris Containers, AIX WPARs, HP-UX SRP Containers, Docker, Podman, Guix), zones (Solaris Containers), virtual private servers (OpenVZ), partitions, virtual environments (VEs), virtual kernels (DragonFly BSD), and jails (FreeBSD jail and chroot).[1] Such instances may look like real computers from the point of view of programs running in them. A computer program running on an ordinary operating system can see all resources (connected devices, files and folders, network shares, CPU power, quantifiable hardware capabilities) of that computer. Programs running inside a container can only see the container's contents and devices assigned to the container.
On Unix-like operating systems, this feature can be seen as an advanced implementation of the standard chroot mechanism, which changes the apparent root folder for the current running process and its children. In addition to isolation mechanisms, the kernel often provides resource-management features to limit the impact of one container's activities on other containers. Linux containers are all based on the virtualization, isolation, and resource management mechanisms provided by the Linux kernel, notably Linux namespaces and cgroups.[2]
Although the word container most commonly refers to OS-level virtualization, it is sometimes used to refer to fuller virtual machines operating in varying degrees of concert with the host OS,[citation needed] such as Microsoft's Hyper-V containers.[citation needed] For an overview of virtualization since 1960, see Timeline of virtualization technologies.
Operation
[edit]On ordinary operating systems for personal computers, a computer program can see (even though it might not be able to access) all the system's resources. They include:
- Hardware capabilities that can be employed, such as the CPU and the network connection
- Data that can be read or written, such as files, folders and network shares
- Connected peripherals it can interact with, such as webcam, printer, scanner, or fax
The operating system may be able to allow or deny access to such resources based on which program requests them and the user account in the context in which it runs. The operating system may also hide those resources, so that when the computer program enumerates them, they do not appear in the enumeration results. Nevertheless, from a programming point of view, the computer program has interacted with those resources and the operating system has managed an act of interaction.
With operating-system-virtualization, or containerization, it is possible to run programs within containers, to which only parts of these resources are allocated. A program expecting to see the whole computer, once run inside a container, can only see the allocated resources and believes them to be all that is available. Several containers can be created on each operating system, to each of which a subset of the computer's resources is allocated. Each container may contain any number of computer programs. These programs may run concurrently or separately, and may even interact with one another.
Containerization has similarities to application virtualization: In the latter, only one computer program is placed in an isolated container and the isolation applies to file system only.
Uses
[edit]Operating-system-level virtualization is commonly used in virtual hosting environments, where it is useful for securely allocating finite hardware resources among a large number of mutually-distrusting users. System administrators may also use it for consolidating server hardware by moving services on separate hosts into containers on the one server.
Other typical scenarios include separating several programs to separate containers for improved security, hardware independence, and added resource management features.[3] The improved security provided by the use of a chroot mechanism, however, is not perfect.[4] Operating-system-level virtualization implementations capable of live migration can also be used for dynamic load balancing of containers between nodes in a cluster.
Overhead
[edit]Operating-system-level virtualization usually imposes less overhead than full virtualization because programs in OS-level virtual partitions use the operating system's normal system call interface and do not need to be subjected to emulation or be run in an intermediate virtual machine, as is the case with full virtualization (such as VMware ESXi, QEMU, or Hyper-V) and paravirtualization (such as Xen or User-mode Linux). This form of virtualization also does not require hardware support for efficient performance.
Flexibility
[edit]Operating-system-level virtualization is not as flexible as other virtualization approaches since it cannot host a guest operating system different from the host one, or a different guest kernel. For example, with Linux, different distributions are fine, but other operating systems such as Windows cannot be hosted. Operating systems using variable input systematics are subject to limitations within the virtualized architecture. Adaptation methods including cloud-server relay analytics maintain the OS-level virtual environment within these applications.[5]
Solaris partially overcomes the limitation described above with its branded zones feature, which provides the ability to run an environment within a container that emulates an older Solaris 8 or 9 version in a Solaris 10 host. Linux branded zones (referred to as "lx" branded zones) are also available on x86-based Solaris systems, providing a complete Linux user space and support for the execution of Linux applications; additionally, Solaris provides utilities needed to install Red Hat Enterprise Linux 3.x or CentOS 3.x Linux distributions inside "lx" zones.[6][7] However, in 2010 Linux branded zones were removed from Solaris; in 2014 they were reintroduced in Illumos, which is the open source Solaris fork, supporting 32-bit Linux kernels.[8]
Storage
[edit]Some implementations provide file-level copy-on-write (CoW) mechanisms. (Most commonly, a standard file system is shared between partitions, and those partitions that change the files automatically create their own copies.) This is easier to back up, more space-efficient and simpler to cache than the block-level copy-on-write schemes common on whole-system virtualizers. Whole-system virtualizers, however, can work with non-native file systems and create and roll back snapshots of the entire system state.
Implementations
[edit]Actively Maintained / Developed Implementations
[edit]| Mechanism | Operating system | License | Start of development | Features | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| File system isolation | Copy on write | Disk quotas | I/O rate limiting | Memory limits | CPU quotas | Network isolation | Nested virtualization | Partition checkpointing and live migration | Root privilege isolation | ||||
| chroot | Most UNIX-like operating systems | Varies by operating system | 1982 | Partial[a] | No | No | No | No | No | No | Yes | No | No |
| Docker | Linux,[10] Windows x64[11] macOS[12] | Apache License 2.0 | 2013 | Yes | Yes | Partial[b] | Yes (since 1.10) | Yes | Yes | Yes | Yes | Only in experimental mode with CRIU [1] | Yes (since 1.10) |
| Podman | Linux, Windows, macOS, FreeBSD | Apache License 2.0 | 2018 | Yes | Yes | Yes[14] | Yes | Yes | Yes | Yes | Yes | Yes[15] | Yes |
| LXC | Linux | GNU GPLv2 | 2008 | Yes[16] | Yes | Partial[c] | Partial[d] | Yes | Yes | Yes | Yes | Yes | Yes[16] |
| Apptainer (formerly Singularity[17]) | Linux | BSD Licence | 2015[18] | Yes[19] | Yes | Yes | No | No | No | No | No | No | Yes[20] |
| OpenVZ | Linux | GNU GPLv2 | 2005 | Yes | Yes[21] | Yes | Yes[e] | Yes | Yes | Yes[f] | Partial[g] | Yes | Yes[h] |
| Virtuozzo | Linux, Windows | Trialware | 2000[25] | Yes | Yes | Yes | Yes[i] | Yes | Yes | Yes[f] | Partial[j] | Yes | Yes |
| Solaris Containers (Zones) | illumos (OpenSolaris), Solaris |
CDDL, Proprietary |
2004 | Yes | Yes (ZFS) | Yes | Partial[k] | Yes | Yes | Yes[l][28][29] | Partial[m] | Partial[n][o] | Yes[p] |
| FreeBSD jail | FreeBSD, DragonFly BSD | BSD License | 2000[31] | Yes | Yes (ZFS) | Yes[q] | Yes | Yes[32] | Yes | Yes[33] | Yes | Partial[34][35] | Yes[36] |
| vkernel | DragonFly BSD | BSD Licence | 2006[37] | Yes[38] | Yes[38] | — | ? | Yes[39] | Yes[39] | Yes[40] | ? | ? | Yes |
| WPARs | AIX | Commercial proprietary software | 2007 | Yes | No | Yes | Yes | Yes | Yes | Yes[r] | No | Yes[42] | ? |
| iCore Virtual Accounts | Windows XP | Freeware | 2008 | Yes | No | Yes | No | No | No | No | ? | No | ? |
| Sandboxie | Windows | GNU GPLv3 | 2004 | Yes | Yes | Partial | No | No | No | Partial | No | No | Yes |
| systemd-nspawn | Linux | GNU LGPLv2.1+ | 2010 | Yes | Yes | Yes[43][44] | Yes[43][44] | Yes[43][44] | Yes[43][44] | Yes | ? | ? | Yes |
| Turbo | Windows | Freemium | 2012 | Yes | No | No | No | No | No | Yes | No | No | Yes |
Historical/Defunct Implementations
[edit]| Mechanism | Operating system | License | Actively developed since or between | Features | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| File system isolation | Copy on write | Disk quotas | I/O rate limiting | Memory limits | CPU quotas | Network isolation | Nested virtualization | Partition checkpointing and live migration | Root privilege isolation | ||||
| Linux-VServer (security context) |
Linux, Windows Server 2016 | GNU GPLv2 | 2001-2018 | Yes | Yes | Yes | Yes[s] | Yes | Yes | Partial[t] | ? | No | Partial[u] |
| lmctfy | Linux | Apache License 2.0 | 2013–2015 | Yes | Yes | Yes | Yes[s] | Yes | Yes | Partial[t] | ? | No | Partial[u] |
| sysjail | OpenBSD, NetBSD | BSD License | 2006–2009 | Yes | No | No | No | No | No | Yes | No | No | ? |
| rkt (rocket) | Linux | Apache License 2.0 | 2014[46]–2018 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | ? | ? | Yes |
See also
[edit]Notes
[edit]- ^ Root user can easily escape from chroot. Chroot was never supposed to be used as a security mechanism.[9]
- ^ For btrfs, overlay2, windowsfilter, and zfs storage drivers. [13]
- ^ Disk quotas per container are possible when using separate partitions for each container with the help of LVM, or when the underlying host filesystem is btrfs, in which case btrfs subvolumes are automatically used.
- ^ I/O rate limiting is supported when using Btrfs.
- ^ Available since Linux kernel 2.6.18-028stable021. Implementation is based on CFQ disk I/O scheduler, but it is a two-level schema, so I/O priority is not per-process, but rather per-container.[22]
- ^ a b Each container can have its own IP addresses, firewall rules, routing tables and so on. Three different networking schemes are possible: route-based, bridge-based, and assigning a real network device (NIC) to a container.
- ^ Docker containers can run inside OpenVZ containers.[23]
- ^ Each container may have root access without possibly affecting other containers.[24]
- ^ Available since version 4.0, January 2008.
- ^ Docker containers can run inside Virtuozzo containers.[26]
- ^ Yes with illumos[27]
- ^ See Solaris network virtualization and resource control for more details.
- ^ Only when top level is a KVM zone (illumos) or a kz zone (Oracle).
- ^ Starting in Solaris 11.3 Beta, Solaris Kernel Zones may use live migration.
- ^ Cold migration (shutdown-move-restart) is implemented.
- ^ Non-global zones are restricted so they may not affect other zones via a capability-limiting approach. The global zone may administer the non-global zones.[30]
- ^ Check the "allow.quotas" option and the "Jails and file systems" section on the FreeBSD jail man page for details.
- ^ Available since TL 02.[41]
- ^ a b Using the CFQ scheduler, there is a separate queue per guest.
- ^ a b Networking is based on isolation, not virtualization.
- ^ a b A total of 14 user capabilities are considered safe within a container. The rest may cannot be granted to processes within that container without allowing that process to potentially interfere with things outside that container.[45]
References
[edit]- ^ Hogg, Scott (2014-05-26). "Software containers: Used more frequently than most realize". Network World. Network world, Inc. Retrieved 2015-07-09.
There are many other OS-level virtualization systems such as: Linux OpenVZ, Linux-VServer, FreeBSD Jails, AIX Workload Partitions (WPARs), HP-UX Containers (SRP), Solaris Containers, among others.
- ^ Rami, Rosen. "Namespaces and Cgroups, the basis of Linux Containers" (PDF). Retrieved 18 August 2016.
- ^ "Secure Bottlerocket deployments on Amazon EKS with KubeArmor | Containers". aws.amazon.com. 2022-10-20. Retrieved 2023-06-20.
- ^ Korff, Yanek; Hope, Paco; Potter, Bruce (2005). Mastering FreeBSD and OpenBSD security. O'Reilly Series. O'Reilly Media, Inc. p. 59. ISBN 0-596-00626-8.
- ^ Huang, D. (2015). "Experiences in using os-level virtualization for block I/O". Proceedings of the 10th Parallel Data Storage Workshop (PDF). pp. 13–18. doi:10.1145/2834976.2834982. ISBN 978-1-4503-4008-3. S2CID 3867190.
- ^ "System administration guide: Oracle Solaris containers-resource management and Oracle Solaris zones, Chapter 16: Introduction to Solaris zones". Oracle Corporation. 2010. Retrieved 2014-09-02.
- ^ "System administration guide: Oracle Solaris containers-resource management and Oracle Solaris zones, Chapter 31: About branded zones and the Linux branded zone". Oracle Corporation. 2010. Retrieved 2014-09-02.
- ^ Bryan Cantrill (2014-09-28). "The dream is alive! Running Linux containers on an illumos kernel". slideshare.net. Retrieved 2014-10-10.
- ^ "3.5. Limiting your program's environment". freebsd.org.
- ^ "Docker drops LXC as default execution environment". InfoQ.
- ^ "Install Docker desktop on Windows | Docker documentation". Docker. 9 February 2023.
- ^ "Get started with Docker desktop for Mac". Docker documentation. December 6, 2019.
- ^ "docker container run - Set storage driver options per container (--storage-opt)". docs.docker.com. 22 February 2024.
- ^ "podman-volume-create — Podman documentation". docs.podman.io. Retrieved 19 October 2025.
- ^ "podman-container-checkpoint — Podman documentation". docs.podman.io. Retrieved 19 October 2025.
- ^ a b Graber, Stéphane (1 January 2014). "LXC 1.0: Security features [6/10]". Retrieved 12 February 2014.
LXC now has support for user namespaces. [...] LXC is no longer running as root so even if an attacker manages to escape the container, he'd find himself having the privileges of a regular user on the host.
- ^ "Community Announcement | Apptainer - Portable, Reproducible Containers". apptainer.org. 2021-11-30. Retrieved 19 October 2025.
- ^ "Sylabs brings Singularity containers into commercial HPC | Top 500 supercomputer sites". www.top500.org.
- ^ "SIF — Containing your containers". www.sylabs.io. 14 March 2018.
- ^ Kurtzer, Gregory M.; Sochat, Vanessa; Bauer, Michael W. (May 11, 2017). "Singularity: Scientific containers for mobility of compute". PLOS ONE. 12 (5) e0177459. Bibcode:2017PLoSO..1277459K. doi:10.1371/journal.pone.0177459. PMC 5426675. PMID 28494014.
- ^ Bronnikov, Sergey. "Comparison on OpenVZ wiki page". OpenVZ Wiki. OpenVZ. Retrieved 28 December 2018.
- ^ "I/O priorities for containers". OpenVZ Virtuozzo Containers Wiki.
- ^ "Docker inside CT".
- ^ "Container". OpenVZ Virtuozzo Containers Wiki.
- ^ "Initial public prerelease of Virtuozzo (named ASPcomplete at that time)".
- ^ "Parallels Virtuozzo now provides native support for Docker".
- ^ Pijewski, Bill (March 1, 2011). "Our ZFS I/O Throttle". wdp.dtrace.org.
- ^ Network virtualization and resource control (Crossbow) FAQ Archived 2008-06-01 at the Wayback Machine
- ^ "Managing network virtualization and network resources in Oracle® Solaris 11.4". docs.oracle.com.
- ^ Oracle Solaris 11.1 administration, Oracle Solaris zones, Oracle Solaris 10 zones and resource management E29024.pdf, pp. 356–360. Available within an archive.
- ^ "Contain your enthusiasm - Part two: Jails, zones, OpenVZ, and LXC".
Jails were first introduced in FreeBSD 4.0 in 2000
- ^ "Hierarchical resource limits - FreeBSD Wiki". Wiki.freebsd.org. 2012-10-27. Retrieved 2014-01-15.
- ^ Zec, Marko (2003-06-13). "Implementing a clonable network stack in the FreeBSD kernel" (PDF). usenix.org.
- ^ "VPS for FreeBSD". Retrieved 2016-02-20.
- ^ "[Announcement] VPS // OS virtualization // alpha release". 31 August 2012. Retrieved 2016-02-20.
- ^ "3.5. Limiting your program's environment". Freebsd.org. Retrieved 2014-01-15.
- ^ Matthew Dillon (2006). "sys/vkernel.h". BSD cross reference. DragonFly BSD.
- ^ a b "vkd(4) — Virtual kernel disc". DragonFly BSD.
treats the disk image as copy-on-write.
- ^ a b Sascha Wildner (2007-01-08). "vkernel, vcd, vkd, vke — virtual kernel architecture". DragonFly miscellaneous information manual. DragonFly BSD.
- "vkernel, vcd, vkd, vke - virtual kernel architecture". DragonFly miscellaneous information manual.
- ^ "vkernel, vcd, vkd, vke - virtual kernel architecture". DragonFly On-Line Manual Pages. DragonFly BSD.
- ^ "IBM fix pack information for: WPAR network isolation - United States". ibm.com. 21 July 2011.
- ^ "Live application mobility in AIX 6.1". www.ibm.com. June 3, 2008.
- ^ a b c d "systemd-nspawn". www.freedesktop.org.
- ^ a b c d "2.3. Modifying control groups Red Hat Enterprise Linux 7". Red Hat Customer portal.
- ^ "Paper - Linux-VServer". linux-vserver.org.
- ^ Polvi, Alex. "CoreOS is building a container runtime, rkt". CoreOS Blog. Archived from the original on 2019-04-01. Retrieved 12 March 2019.
External links
[edit]- An introduction to virtualization Archived 2019-11-28 at the Wayback Machine
- A short intro to three different virtualization techniques
- Virtualization and containerization of application infrastructure: A comparison Archived 2023-03-15 at the Wayback Machine, June 22, 2015, by Mathijs Jeroen Scheepers
- Containers and persistent data, LWN.net, May 28, 2015, by Josh Berkus
OS-level virtualization
View on GrokipediaFundamentals
Definition and Principles
OS-level virtualization is an operating system paradigm that enables the kernel to support multiple isolated user-space instances, referred to as containers, which operate on the same host kernel without requiring separate operating systems or hardware emulation. This method partitions the user space into distinct environments, allowing each instance to maintain its own processes, libraries, and configurations while sharing kernel services.[1] The foundational principles revolve around kernel sharing, namespace isolation, and resource control. Kernel sharing permits all containers to leverage the host operating system's kernel directly for system calls, minimizing overhead compared to approaches that involve kernel duplication or emulation. Namespace isolation creates bounded views of system resources for each container, including separate process identifiers, network stacks, and mount points, ensuring that changes in one instance do not affect others. Resource control, typically implemented through control groups (cgroups), enforces limits on CPU, memory, disk I/O, and network usage, grouping processes and allocating quotas to maintain fairness and prevent resource exhaustion.[11][1] In contrast to basic process isolation, which confines individual applications within the shared user space using limited mechanisms like chroot jails, OS-level virtualization delivers complete, self-contained operating system environments per container, encompassing full user-space hierarchies, independent filesystems, and multi-process execution. This enables containers to function as lightweight, portable units akin to virtual machines but with native kernel access.[1] The architecture features a single host kernel at its core, servicing processes from multiple containers through isolated namespaces that provide distinct filesystems, process trees, and resource domains, while cgroups overlay constraints to govern shared hardware access across instances. This layered design ensures efficient resource utilization and strong separation without the need for a hypervisor.[11][1]Historical Development
The origins of OS-level virtualization trace back to early Unix mechanisms designed to enhance security and isolation. In 1979, the chroot system call was introduced in Unix Version 7, allowing processes to be confined to a specific subdirectory as their apparent root filesystem, effectively creating a lightweight form of isolation without full kernel separation.[9] This precursor laid foundational concepts for restricting file system access in shared environments. Building on this, FreeBSD introduced Jails in 2000 with the release of FreeBSD 4.0, providing more comprehensive isolation by virtualizing aspects of the file system, users, and network stack within a single kernel, enabling multiple independent instances of the operating system.[12] The early 2000s saw the emergence of similar technologies in Linux, driven by the need for efficient server partitioning. In 2001, Jacques Gélinas developed Linux VServer, a patch-based approach that allowed multiple virtual private servers to run isolated on a single physical host by modifying the kernel to support context switching for processes.[12] This was followed in 2005 by OpenVZ, a commercial offering from SWsoft (later Virtuozzo) based on a modified Linux kernel, which introduced resource controls and process isolation for hosting multiple virtual environments with minimal overhead.[13] By 2008, the Linux Containers (LXC) project, initiated by engineers at IBM, combined Linux kernel features like cgroups for resource limiting and namespaces for isolation to create user-space tools for managing containers, marking a shift toward standardized, non-patched implementations.[12] The 2010s brought widespread adoption through innovations that simplified deployment and orchestration. Docker, first released in 2013 by Solomon Hykes and the dotCloud team, revolutionized OS-level virtualization by introducing a portable packaging format and runtime based on LXC (later its own libcontainer), making containers accessible for developers and dramatically increasing their use in application deployment.[14] Its impact popularized containerization, shifting focus from infrastructure management to DevOps workflows. In 2014, Google open-sourced Kubernetes, an orchestration system evolved from its internal Borg tool, enabling scalable management of containerized applications across clusters and integrating seamlessly with Docker for automated deployment, scaling, and operations.[15] Microsoft entered the space around 2016 with Windows Server containers, adapting the technology for Windows environments through partnerships with Docker, allowing isolated application execution sharing the host kernel.[16] Key contributors have included major technology companies advancing the ecosystem. Google has been pivotal through its development of core kernel features like namespaces and cgroups, as well as Kubernetes, which by 2024 managed billions of containers weekly.[12] Red Hat has contributed extensively to upstream Linux components, LXC tooling, and Kubernetes via projects like OpenShift, fostering open-source standards through the Open Container Initiative.[12] As of 2025, advancements include deeper integration with Kubernetes for hybrid cloud workloads and enhancements in Windows Server 2025 (released November 2024), such as expanded container portability allowing Windows Server 2022-based containers to run on 2025 hosts and improved support for HostProcess containers in node operations.[17]Technical Operation
Core Mechanisms
OS-level virtualization initializes containers through a kernel-mediated process creation that establishes isolated execution contexts sharing the host operating system kernel. The process begins when the container runtime invokes the clone() system call to spawn the container's init process, specifying flags that configure its resource sharing and execution environment.[18] The kernel handles subsequent system calls from this process and its descendants by applying the predefined constraints, mapping them to a bounded view of system resources and preventing interference with the host or other containers. This mapping treats container processes as standard host processes but confines their operations to the allocated scopes, enabling lightweight virtualization without hypervisor overhead. Resource allocation in OS-level virtualization is primarily governed by control groups (cgroups), a kernel feature that hierarchically organizes processes and enforces limits on CPU, memory, and I/O usage to prevent resource contention. In the unified cgroup v2 hierarchy, the CPU controller applies quotas via the cpu.max parameter, which specifies maximum execution time within a period; for instance, setting "200000 1000000" limits a container to 200 microseconds of CPU every 1 second, throttling excess usage under the fair scheduler.[19] The memory controller imposes hard limits through memory.max, such as "1G" to cap usage at 1 gigabyte, invoking the out-of-memory killer if the limit is breached after failed reclamation attempts.[19] For I/O, the io controller regulates bandwidth and operations per second using io.max, exemplified by "8:16 rbps=2097152" to restrict reads on block device 8:16 to 2 MB/s, delaying requests that exceed the quota.[19] Filesystem handling leverages overlay filesystems to compose container root filesystems from immutable base images and mutable overlays, optimizing storage by avoiding full copies. OverlayFS, integrated into the Linux kernel since version 3.18, merges a writable upper directory with one or more read-only lower directories into a single view, directing all modifications to the upper layer while reads fall back to lower layers if needed.[20] Upon write access to a lower-layer file, OverlayFS performs a copy-up operation to replicate it in the upper layer, ensuring changes do not alter shared read-only bases; this mechanism supports efficient layering in container images, where multiple containers can reference the same lower layers concurrently.[20] Networking in OS-level virtualization is configured using virtual Ethernet (veth) devices paired with software bridges to provide isolated yet interconnected network stacks for containers. A veth pair is created such that one endpoint resides in the container's network context and the other in the host's, with the host endpoint enslaved to a bridge interface acting as a virtual switch.[21] This setup enables container-to-container communication over the bridge, as packets transmitted from one veth end are received on its peer and forwarded accordingly; for external access, the bridge often integrates with host routing and NAT rules to simulate a local subnet.[21]Isolation Techniques
OS-level virtualization achieves isolation primarily through kernel-provided primitives that segment system resources and views for containerized processes, preventing interference with the host or other containers. In Linux, the dominant platform for this technology, these techniques leverage namespaces, capability restrictions, and syscall filters to enforce boundaries without emulating hardware. This approach contrasts with full virtualization by sharing the host kernel, which necessitates careful privilege management to maintain security. Linux namespaces provide per-process isolation by creating separate instances of kernel resources, allowing containers to operate in abstracted environments. The PID namespace (introduced in kernel 2.6.24) isolates process identifiers, enabling each container to maintain its own PID hierarchy where the init process appears as PID 1, thus preventing process visibility and signaling across boundaries.[22] The network namespace (since kernel 2.6.24) segregates network interfaces, IP addresses, routing tables, and firewall rules, allowing containers to have independent network stacks without affecting the host or peers.[22] Mount namespaces (available since kernel 2.4.19) isolate filesystem mount points, permitting containers to view customized directory structures while the host sees the global filesystem, which supports private overlays for application data.[22] User namespaces (introduced in kernel 3.8) remap user and group IDs between the container and host, enabling unprivileged users on the host to run as root inside the container via ID mappings, thereby confining privilege escalations.[22] Finally, IPC namespaces (since kernel 2.6.19) separate System V IPC objects and POSIX message queues, ensuring inter-process communication remains confined within the container and does not leak to others.[22] To further restrict kernel interactions, Linux capabilities decompose root privileges into granular units, allowing container processes to execute only authorized operations. Capabilities such as CAP_SYS_ADMIN for administrative tasks or CAP_NET_BIND_SERVICE for port binding are dropped or bounded for container threads, preventing unauthorized system modifications while retaining necessary functionality.[23] Complementing this, seccomp (secure computing mode, available since kernel 2.6.12 and enhanced with BPF filters in 3.5) confines system calls by loading user-defined filters that allow, kill, or error on specific invocations, reducing the kernel attack surface in containers by blocking potentially exploitable paths.[24] Rootless modes enhance isolation by eliminating the need for host root privileges during container execution, relying on user namespaces to map container root to a non-privileged host user. In implementations like Docker's rootless mode or Podman's default operation, containers run under the invoking user's context, avoiding daemon privileges and limiting escape risks from compromised containers.[25] This approach confines file access, network bindings, and device interactions to user-permitted scopes, improving security in multi-tenant environments.[26] Despite these techniques, kernel sharing introduces inherent limitations, as all containers and the host execute within the same kernel space, enabling vulnerability propagation. A kernel bug exploitable by one container can compromise the entire system, including other containers, due to shared memory and resources; for instance, abstract resource exhaustion attacks can deplete global kernel structures like file descriptors or network counters from non-privileged containers, causing denial-of-service across isolates.[27] Namespaces and capabilities mitigate some interactions but fail against kernel-level flaws, underscoring the need for complementary host hardening.[27]Comparisons to Other Virtualization Methods
With Full Virtualization
OS-level virtualization, often implemented through container technologies, fundamentally differs from full virtualization in its architectural approach. In OS-level virtualization, multiple isolated environments share the host operating system's kernel, leveraging mechanisms such as namespaces and control groups to provide process isolation without emulating hardware.[28] In contrast, full virtualization employs a hypervisor to create virtual machines (VMs), each running a complete guest operating system with its own kernel on emulated or paravirtualized hardware, introducing an additional layer of abstraction between the guest and physical resources.[29] This shared-kernel model in OS-level virtualization avoids the overhead of kernel emulation, enabling lighter-weight isolation at the operating system level.[30] Performance implications arise primarily from these architectural differences. OS-level virtualization achieves near-native performance due to the absence of hypervisor-mediated hardware emulation, resulting in lower CPU and memory overhead—typically under 3% for basic operations—compared to full virtualization, where hypervisor intervention can impose up to 80% higher latency for I/O-intensive tasks.[28] However, the shared kernel in OS-level virtualization introduces risks, such as potential system-wide impacts from a compromised or faulty container, whereas full virtualization's separate kernels enhance fault isolation but at the cost of increased resource consumption, including larger memory footprints (e.g., several gigabytes per VM for a full OS).[29] This efficiency in resource usage allows OS-level virtualization to support higher density of instances on the same hardware. The suitability of each method depends on the deployment environment. OS-level virtualization excels in lightweight, homogeneous setups where applications run on the same host kernel, such as scaling microservices in cloud-native architectures, but it is limited to compatible operating systems.[28] Full virtualization, conversely, supports diverse guest operating systems and provides stronger isolation for heterogeneous or security-sensitive workloads, making it preferable for running legacy applications or untrusted code across different OS families.[29] For instance, hosting multiple Linux distributions on a Linux-based host is more efficient via containers like those in Docker, which share the kernel for rapid deployment, whereas VMs would require separate kernels and hypervisor orchestration for the same task, increasing overhead.[30]With Application Virtualization
OS-level virtualization and application virtualization both enable isolation and portability for software execution but differ fundamentally in scope and implementation. OS-level virtualization creates lightweight, isolated environments that mimic full operating system instances by sharing the host kernel while partitioning user-space resources such as processes, filesystems, and networks.[32] In contrast, application virtualization focuses on encapsulating individual applications with their dependencies in a sandboxed layer, abstracting them from the underlying OS without replicating OS-level structures.[32] This distinction arises because OS-level approaches, like containerization, virtualize at the kernel boundary to support multiple isolated services or workloads, whereas application virtualization operates higher in the stack, targeting app-specific execution.[33] A primary difference lies in the isolation scope: OS-level virtualization provides broad separation affecting entire process trees, filesystems, and networking stacks, often using kernel features like namespaces for comprehensive containment.[32] Application virtualization, however, offers narrower isolation, typically limited to the application's libraries, registry entries, or file accesses, preventing conflicts with the host OS or other apps but not extending to full system-like boundaries.[34] For instance, in application virtualization, mechanisms like virtual filesystems or registry virtualization shield the app from host modifications, but the app still interacts directly with the host kernel for core operations.[34] Regarding overhead and portability, OS-level virtualization incurs minimal runtime costs due to kernel sharing but is inherently tied to the host kernel's compatibility, limiting cross-OS deployment—for example, Linux containers require a Linux host.[32] Application virtualization generally has even lower overhead, as it avoids OS emulation entirely, and enhances portability by bundling dependencies to run across OS versions or distributions without kernel constraints.[35] This makes app-level approaches suitable for diverse environments, though they provide less comprehensive isolation, potentially exposing more to host vulnerabilities.[36] Representative examples highlight these contrasts. Docker, an OS-level virtualization tool, packages applications with their OS dependencies into containers that include isolated filesystems and processes, enabling consistent deployment of multi-process services but requiring kernel compatibility. Flatpak, an application virtualization framework for Linux desktops, bundles apps with runtimes and dependencies in sandboxed environments, prioritizing cross-distribution portability and app-specific isolation without full OS replication.[35] Similarly, the Java Virtual Machine (JVM) virtualizes execution at the bytecode level, isolating Java applications through managed memory and security sandboxes, but it operates as a process on the host OS rather than providing OS-wide separation.[36] Windows App-V streams virtualized applications in isolated bubbles, avoiding installation conflicts via virtualized files and registry, yet it remains dependent on the Windows host without container-like process isolation.[34]Benefits and Limitations
Key Advantages
OS-level virtualization offers low resource overhead compared to full virtualization methods, as containers share the host kernel and require no guest OS emulation, enabling near-native performance with minimal CPU and memory consumption.[37] This shared kernel architecture results in significantly faster startup times, typically in seconds for containers versus minutes for virtual machines that must boot an entire OS.[38] For instance, empirical studies show containers achieving startup latencies under 1 second in lightweight configurations, allowing for rapid deployment and scaling in resource-constrained environments.[28] A key advantage is the flexibility provided by image-based deployment, which facilitates easy portability and scaling across homogeneous host systems sharing the same kernel.[39] Container images encapsulate applications and dependencies in a standardized format, enabling seamless migration between development, testing, and production hosts without reconfiguration, thus supporting dynamic orchestration in clustered setups.[38] This portability is particularly beneficial for microservices architectures, where workloads can be replicated or load-balanced efficiently on compatible infrastructure. Storage efficiency is enhanced through layered filesystems, such as union filesystems used in implementations like Docker, which minimize duplication by sharing read-only base layers among multiple containers or images. For example, if several containers derive from the same base image, common layers are stored once, reducing overall disk usage—for instance, five containers from a 7.75 MB image might collectively use far less space than equivalent virtual machine disk copies due to copy-on-write mechanisms that only duplicate modified files. This approach not only conserves storage but also accelerates image pulls and container instantiation by avoiding full filesystem replication.[40] In development and testing, OS-level virtualization ensures consistent environments that closely mirror production setups, mitigating issues like "it works on my machine" by packaging applications with exact dependencies in portable images.[38] Developers can replicate production-like isolation for testing without the overhead of full OS instances, fostering faster iteration cycles and reducing deployment discrepancies across teams.[41]Challenges and Drawbacks
One of the primary challenges in OS-level virtualization is the heightened security risk stemming from the shared kernel architecture, where all containers run on the host system's kernel. This shared model means that a vulnerability in the kernel can compromise every container simultaneously, unlike full virtualization where each virtual machine has its own isolated kernel.[42] For instance, kernel-level exploits, such as those involving namespace breaches or privilege escalations, enable container escape attacks that allow malicious code to access the host system or other containers.[42] Research analyzing over 200 container-related vulnerabilities has identified shared kernel issues as a key enabler of such escapes, with examples including CVE-2019-5736, where attackers overwrite the runc binary to gain host privileges.[42] Additionally, the reduced isolation compared to hypervisor-based systems amplifies the attack surface, particularly in multi-tenant environments, as resource sharing facilitates side-channel attacks and timing vulnerabilities.[43] Recent research, such as the 2025 CKI proposal, explores hardware-software co-designs to provide stronger kernel isolation for containers.[44] Compatibility limitations further constrain OS-level virtualization, as it restricts deployments to operating systems and kernel variants compatible with the host kernel. Containers cannot natively support guest operating systems different from the host, such as running a Windows container on a Linux host, without additional emulation layers that introduce significant overhead.[1] Kernel version mismatches exacerbate this issue; for example, an older container image built for an earlier kernel may fail on a newer host due to changes in system calls or libraries, as seen in cases where RHEL 6 containers encounter errors like useradd failures on RHEL 7 hosts because of libselinux incompatibilities.[45] This lack of flexibility also limits architectural diversity, preventing seamless support for different CPU architectures without emulation, which undermines the efficiency gains of containerization.[1] Managing OS-level virtualization at scale introduces significant complexity, particularly in orchestration, scaling, and debugging across shared resources. Without dedicated tools like Kubernetes, administrators must manually handle provisioning, load balancing, and updates for numerous containers, which becomes impractical in large deployments involving hundreds of nodes.[46] Scaling requires careful monitoring to avoid under- or over-allocation, while debugging is hindered by the need to trace issues across interconnected, shared-kernel environments, often lacking automated health checks or self-healing mechanisms.[46] Even with orchestration platforms, enforcing consistent security and network configurations adds overhead, as the ephemeral nature of containers demands precise coordination to prevent downtime or misconfigurations.[46] Persistence and state management pose additional hurdles in OS-level virtualization, especially for stateless designs that prioritize ephemerality but struggle with stateful applications. Containers are inherently transient, losing all internal data upon restart or redeployment, which complicates maintaining consistent state for applications like databases that require durable storage.[47] This necessitates external mechanisms, such as persistent volumes in Kubernetes, to decouple data from the container lifecycle, yet integrating these introduces risks of configuration drift and challenges in ensuring data integrity across cluster mobility or failures.[48] In Kubernetes environments, the declarative model excels for stateless workloads but conflicts with persistent data needs, often leading to manual interventions for backups, migrations, or recovery, with recovery time objectives potentially exceeding 60 minutes without specialized solutions.[48]Implementations
Linux-Based Systems
Linux-based systems dominate OS-level virtualization due to the kernel's native support for key isolation and resource management primitives. The Linux kernel provides foundational features such as namespaces, which isolate process IDs, network stacks, mount points, user IDs, inter-process communication, and time, enabling containers to operate in isolated environments without emulating hardware. Control groups (cgroups), particularly the unified hierarchy in cgroups v2 introduced in kernel 4.5 in 2016 and stabilized in subsequent releases up to 2025, allow precise resource limiting, accounting, and prioritization for CPU, memory, I/O, and network usage across containerized processes.[19] These features, matured through iterative kernel development, form the bedrock for higher-level tools by enabling lightweight, efficient virtualization without full OS emulation. LXC (Linux Containers) serves as a foundational userspace interface to these kernel capabilities, allowing users to create and manage system containers that run full Linux distributions with init systems and multiple processes.[49] It offers a powerful API for programmatic control and simple command-line tools likelxc-create, lxc-start, and lxc-execute to handle container lifecycles, with built-in templates for bootstrapping common distributions such as Ubuntu or Fedora. LXC emphasizes flexibility for low-level operations, including direct manipulation of namespaces and cgroups, making it suitable for development and testing environments where fine-grained control is needed.[50]
Building on LXC, LXD provides a higher-level, API-driven management layer for system containers and virtual machines, offering a RESTful API for remote administration and clustering support across multiple hosts.[51] Developed by Canonical, LXD enables unified management of full Linux systems in containers via command-line tools like lxc (its client) or graphical interfaces, with features such as live migration, snapshotting, and device passthrough for enhanced scalability in production setups.[52] As of 2025, LXD 5.x LTS releases include improved security profiles and integration with cloud storage for image distribution, positioning it as a robust alternative for enterprise container orchestration.[53]
Docker revolutionized containerization as a runtime that leverages OCI (Open Container Initiative) standards for image packaging and execution, allowing developers to build, ship, and run applications in isolated environments with minimal overhead. Its image format uses layered filesystems for efficient storage and sharing, where changes to base images create immutable layers, reducing duplication and enabling rapid deployments. The ecosystem extends through tools like Docker Compose, which defines multi-container applications via YAML files specifying services, networks, and volumes, facilitating complex setups like microservices architectures with a single docker-compose up command.[54] By 2025, Docker's runtime has evolved to support rootless modes and enhanced security scanning, solidifying its role in DevOps workflows.[55]
Podman and Buildah offer daemonless, rootless alternatives to Docker, emphasizing security by avoiding a central privileged service and allowing non-root users to manage containers.[56] Podman, developed by Red Hat, provides Docker-compatible CLI commands for running, pulling, and inspecting OCI images while integrating seamlessly with systemd for service management and supporting pod-like groupings for Kubernetes-style deployments.[57] Its rootless operation confines privileges within user namespaces, mitigating risks from daemon vulnerabilities, and as of 2025, it includes GPU passthrough and build caching for performant workflows. Complementing Podman, Buildah focuses on image construction without launching containers, using commands like buildah from and buildah run to layer instructions from Containerfiles, enabling secure, offline builds in CI/CD pipelines.[58]
Systemd-nspawn acts as a lightweight, integrated tool within the systemd suite for bootstrapping and running containers from disk images or directories, providing basic isolation via kernel namespaces without external dependencies.[59] It supports features like private networking, bind mounts for shared resources, and seamless integration with systemd's journaling for logging, making it ideal for quick testing or chroot-like environments on systemd-based distributions. As a built-in utility since systemd 220 in 2014, it excels in simplicity for single-host scenarios, with capabilities to expose container consoles and manage ephemeral instances via machinectl.[59]
Other Operating Systems
FreeBSD introduced jails in version 4.0 in March 2000 as a mechanism for OS-level virtualization, building on the chroot concept to provide isolated environments with fine-grained resource controls such as CPU limits, memory restrictions, and network isolation, similar to zones in other systems.[60] Jails allow multiple instances of the FreeBSD kernel to run securely on a single host by restricting process visibility and privileges, enabling efficient consolidation of services without full hardware emulation.[61] Oracle Solaris implemented zones starting with Solaris 10 in 2005, featuring a global zone that oversees the system and multiple non-global zones that share the host kernel while providing isolated filesystems, processes, and network stacks for application containment.[62] Illumos, the open-source derivative of Solaris, retains zones with comparable functionality, integrating ZFS filesystem support for efficient snapshots and cloning of zone environments to facilitate rapid deployment and rollback.[63] This design emphasizes resource pooling and scalability for enterprise workloads, with zones configured via XML manifests for properties like CPU shares and IP filtering. Microsoft's Windows Containers, available since Windows Server 2016, operate in two isolation modes: process-isolated containers that share the host kernel for lightweight operation, and Hyper-V isolated containers that use a dedicated kernel in a minimal virtual machine for stronger security boundaries against kernel exploits.[64] Post-2020 enhancements via Windows Subsystem for Linux 2 (WSL 2) enable running OCI-compliant Linux containers on Windows hosts using a lightweight Hyper-V VM, independently from native Windows Containers which are limited to Windows workloads.[65] Apple's Virtualization framework, introduced in macOS 11 Big Sur in 2020, supports OS-level virtualization through APIs for creating lightweight virtual machines that emulate container-like isolation on Apple Silicon and Intel-based systems, optimized for running Linux guests with minimal overhead.[66] Emerging tools in the 2020s, such as open-source projects building on this framework, enable OCI-standard Linux containers on macOS, providing secure, native execution without third-party hypervisors. Notably, Apple's open-source Containerization project, released at WWDC 2025, enables running OCI-compliant Linux containers natively on macOS using the Virtualization framework.[67] Cross-platform interoperability in OS-level virtualization is advanced by runc, the reference command-line tool implementing the Open Container Initiative (OCI) runtime specification for Linux since its v1.0 release in 2017, with the specification (updated to v1.3 in November 2025 to officially include FreeBSD) enabling consistent container bundle formats and execution across platforms including FreeBSD, Solaris derivatives, Windows, and macOS via platform-specific implementations.[68][69]Applications and Adoption
Primary Use Cases
OS-level virtualization, commonly implemented through container technologies, finds its primary applications in environments demanding lightweight isolation, portability, and scalability for modern software development and deployment.[2] This approach allows multiple isolated user-space instances to run on a shared kernel, making it ideal for dynamic workloads without the overhead of full operating system emulation.[2] A key use case is in microservices architecture, where containers package individual services with their dependencies, enabling independent development, scaling, and deployment within cloud-native applications.[70] This facilitates breaking down complex applications into smaller, loosely coupled components that can be orchestrated using tools like Kubernetes, with surveys indicating that 80% of organizations leverage such setups for production microservices.[70] By sharing the host kernel, containers reduce resource consumption compared to virtual machines, allowing teams to iterate rapidly without interference.[2] In continuous integration and continuous deployment (CI/CD) pipelines, OS-level virtualization provides isolated, ephemeral environments for automated building, testing, and deployment of code.[71] Tools like Jenkins integrated with Docker create consistent setups that mirror production, ensuring reproducibility and minimizing "it works on my machine" issues across development stages.[71] This setup supports rapid feedback loops, as containers start quickly and allow incremental vulnerability scanning during the pipeline.[71] Server consolidation represents another major application, where multiple applications or services are hosted on a single physical machine to optimize hardware utilization and reduce infrastructure costs.[2] Unlike full virtualization, OS-level containers avoid duplicating entire operating systems, enabling efficient packing of workloads on bare metal or cloud hosts without VM sprawl.[72] This method is particularly effective for legacy application migration, as it consolidates underutilized servers into fewer instances while maintaining isolation.[72] For edge computing, OS-level virtualization supports lightweight deployments on resource-constrained devices, such as System-on-Chip (SoC) platforms in IoT or remote locations.[73] Solutions like Linux Containers (LXC) and Docker exhibit low overhead on these systems, with LXC demonstrating minimal CPU and memory impact for high-performance tasks, making it suitable for real-time processing near data sources.[73] Containers' portability ensures applications function consistently from development to edge deployment, addressing latency and bandwidth limitations in distributed environments.[2]Industry Examples
In the cloud computing sector, Amazon Web Services (AWS) leverages OS-level virtualization through its Elastic Container Service (ECS) with Fargate, enabling serverless deployment of containerized applications for scalable workloads such as data processing and microservices, supporting up to 16 vCPUs and 120 GB of memory per task without managing underlying infrastructure.[74] Similarly, Google Cloud's Kubernetes Engine (GKE) utilizes containers to orchestrate massive-scale workloads, accommodating clusters of up to 65,000 nodes and integrating with AI infrastructure for efficient gen AI inference, achieving 30% lower serving costs and 40% higher throughput compared to traditional setups.[75] Netflix exemplifies DevOps adoption of OS-level virtualization by employing Docker containers to orchestrate millions of instances weekly, facilitating rapid deployment cycles and enhanced velocity in continuous integration/continuous deployment (CI/CD) pipelines, which supports A/B testing for feature rollouts and personalization experiments.[76] For enterprise environments, IBM and Red Hat promote hybrid cloud management via OpenShift, a Kubernetes-based platform that deploys containerized applications across on-premises, private, and public clouds, streamlining operations for large-scale modernization and ensuring consistency in multicloud strategies.[77] Emerging trends highlight integration with AI/ML pipelines, where containerized TensorFlow models via TensorFlow Extended (TFX) enable end-to-end production workflows on Kubernetes-orchestrated environments, supporting scalable data processing, training, and serving for enterprise AI adoption.[78] In telecommunications, Verizon incorporates containers through Red Hat OpenShift on its 5G Edge platform to virtualize network functions and enable low-latency edge computing, accelerating innovation in mobile edge applications and hybrid infrastructure.[79]References
- https://learn.[microsoft](/page/Microsoft).com/en-us/virtualization/windowscontainers/about/containers-vs-vm
