Recent from talks
Nothing was collected or created yet.
Second Level Address Translation
View on WikipediaSecond Level Address Translation (SLAT), also known as nested paging, is a hardware-assisted virtualization technology which makes it possible to avoid the overhead associated with software-managed shadow page tables.
AMD has supported SLAT through the Rapid Virtualization Indexing (RVI) technology since the introduction of its third-generation Opteron processors (code name Barcelona). Intel's implementation of SLAT, known as Extended Page Table (EPT), was introduced in the Nehalem microarchitecture found in certain Core i7, Core i5, and Core i3 processors.
ARM's virtualization extensions support SLAT, known as Stage-2 page-tables provided by a Stage-2 MMU. The guest uses the Stage-1 MMU. Support was added as optional in the ARMv7ve architecture and is also supported in the ARMv8 (32-bit and 64-bit) architectures.
Overview
[edit]The introduction of protected mode to the x86 architecture with the Intel 80286 processor brought the concepts of physical memory and virtual memory to mainstream architectures. When processes use virtual addresses and an instruction requests access to memory, the processor translates the virtual address to a physical address using a page table or translation lookaside buffer (TLB). When running a virtual system, it has allocated virtual memory of the host system that serves as a physical memory for the guest system, and the same process of address translation goes on also within the guest system. This increases the cost of memory access since the address translation needs to be performed twice – once inside the guest system (using software-emulated guest page table), and once inside the host system (using physical map[pmap]).
A software based shadow page table is a common solution to reduce translation overhead compared to double translation. Shadow page tables translate guest virtual addresses directly to host physical addresses. Each VM has a separate shadow page table and the hypervisor is in charge of managing them. While shadow page tables are faster than double translation, they are still expensive compared to not running in a virtual machine: every time a guest updates its page tables, it requires the hypervisor to also manage changes in the shadow tables.
In order to make this translation more efficient, processor vendors implemented technologies commonly called SLAT. By treating each guest-physical address as a host-virtual address, a slight extension of the hardware used to walk a non-virtualized page table (now the guest page table) can walk the host page table. With multilevel page tables the host page table can be viewed conceptually as nested within the guest page table. A hardware page table walker can treat the additional translation layer almost like adding levels to the page table.
Using SLAT and multilevel page tables, the number of levels needed to be walked to find the translation doubles when the guest-physical address is the same size as the guest-virtual address and the same size pages are used. This increases the importance of caching values from intermediate levels of the host and guest page tables. It is also helpful to use large pages in the host page tables to reduce the number of levels (e.g., in x86-64, using 2 MB pages removes one level in the page table). Since memory is typically allocated to virtual machines at coarse granularity, using large pages for guest-physical translation is an obvious optimization, reducing the depth of look-ups and the memory required for host page tables.
Implementations
[edit]Rapid Virtualization Indexing
[edit]Rapid Virtualization Indexing (RVI), known as Nested Page Tables (NPT) during its development, is an AMD second generation hardware-assisted virtualization technology for the processor memory management unit (MMU).[1][2] RVI was introduced in the third generation of Opteron processors, code name Barcelona.[3]
A VMware research paper found that RVI offers up to 42% gains in performance compared with software-only (shadow page table) implementation.[4] Tests conducted by Red Hat showed a doubling in performance for OLTP benchmarks.[5]
Extended Page Tables
[edit]Extended Page Tables (EPT) is an Intel second-generation x86 virtualization technology for the memory management unit (MMU). EPT support is found in Intel's Core i3, Core i5, Core i7 and Core i9 CPUs, among others.[6] It is also found in some newer VIA CPUs. EPT is required in order to launch a logical processor directly in real mode, a feature called "unrestricted guest" in Intel's jargon, and introduced in the Westmere microarchitecture.[7][8]
According to a VMware evaluation paper, "EPT provides performance gains of up to 48% for MMU-intensive benchmarks and up to 600% for MMU-intensive microbenchmarks", although it can actually cause code to run slower than a software implementation in some corner cases.[9]
Stage-2 page-tables
[edit]Stage-2 page-table support is present in ARM processors that implement exception level 2 (EL2).
Extensions
[edit]Mode Based Execution Control
[edit]Mode Based Execution Control (MBEC) is an extension to x86 SLAT implementations first available in Intel Kaby Lake and AMD Zen+ CPUs (known on the latter as Guest Mode Execute Trap or GMET).[10] The extension extends the execute bit in the extended page table (guest page table) into 2 bits - one for user execute, and one for supervisor execute.[11]
MBEC was introduced to speed up guest usermode unsigned code execution with kernelmode code integrity enforcement. Under this configuration, unsigned code pages can be marked as execute under usermode, but must be marked as no-execute under kernelmode. To maintain integrity by ensuring all guest kernelmode executable code are signed even when the guest kernel is compromised, the guest kernel does not have permission to modify the execute bit of any memory pages. Modification of the execute bit, or switching of the guest page table which contains the execute bit, is delegated to a higher privileged entity, in this case the host hypervisor. Without MBE, each entrance from unsigned usermode execution to signed kernelmode execution must be accompanied by a VM exit to the hypervisor to perform a switch to the kernelmode page table. On the reverse operation, an exit from signed kernelmode to unsigned usermode must be accompanied by a VM exit to perform another page table switch. VM exits significantly impact code execution performance.[12][13] With MBE, the same page table can be shared between unsigned usermode code and signed kernelmode code, with two sets of execute permission depending on the execution context. VM exits are no longer necessary when execution context switches between unsigned usermode and signed kernel mode.
Support in software
[edit]Hypervisors that support SLAT include the following:
- Hyper-V for Windows Server 2008 R2, Windows 8 and later.[14] The Windows 8 (and later Microsoft Windows) Hyper-V requires SLAT.[15][16]
- Hypervisor.framework, a native macOS hypervisor, available since macOS 10.10[17]
- KVM, since version 2.6.26 of the Linux kernel mainline[18][19]
- Parallels Desktop for Mac, since version 5[20]
- VirtualBox, since version 2.0.0[21]
- VMware ESX, since version 3.5[4]
- VMware Workstation. VMware Workstation 14 (and later VMware Workstation) requires SLAT.[22]
- Xen, since version 3.2.0[23]
- Qubes OS — SLAT mandatory[24]
- bhyve[25][26] — SLAT mandatory and slated to remain mandatory
- vmm, a native hypervisor on OpenBSD — SLAT mandatory[27][28]
- ACRN, an open-source lightweight hypervisor, built with real-time and safety-criticality in mind, optimized for IoT and Edge usages.[29]
- QEMU - an open-source embeddable hypervisor and chipset emulator. [30][31][32][33][34]
Some of the above hypervisors require SLAT in order to work at all (not just faster) as they do not implement a software shadow page table; the list is not fully updated to reflect that.
See also
[edit]- AMD-V (codename Pacifica) – the first-generation AMD hardware virtualization support
- Page table
- VT-x
References
[edit]- ^ "Rapid Virtualization Indexing with Windows Server 2008 R2 Hyper-V | The Virtualization Blog". Blogs.amd.com. 2009-03-23. Retrieved 2010-05-16.
- ^ "AMD-V Nested Paging" (PDF). July 2008. Archived from the original (PDF) on 2012-09-05. Retrieved 2013-12-11.
- ^ "VMware engineer praises AMD's Nested Page Tables". Searchservervirtualization.techtarget.com. 2008-07-21. Retrieved 2010-05-16.
- ^ a b "Performance Evaluation of AMD RVI Hardware Assist" (PDF). Retrieved 2010-05-16.
- ^ "Red Hat Magazine | Red Hat Enterprise Linux 5.1 utilizes nested paging on AMD Barcelona Processor to improve performance of virtualized guests". Magazine.redhat.com. 2007-11-20. Retrieved 2010-05-16.
- ^ "Intel Virtualization Technology List". Ark.intel.com. Retrieved 2014-02-17.
- ^ "Intel added unrestricted guest mode on Westmere micro-architecture and later Intel CPUs, it uses EPT to translate guest physical address access to host physical address. With this mode, VMEnter without enable paging is allowed."
- ^ "Intel 64 and IA-32 Architectures Developer's Manual, Vol. 3C" (PDF). Intel. Retrieved 13 December 2015.
If the 'unrestricted guest' VM-execution control is 1, the 'enable EPT' VM-execution control must also be 1.
- ^ Performance Evaluation of Intel EPT Hardware Assist
- ^ Cunningham, Andrew (2021-08-27). "Why Windows 11 has such strict hardware requirements, according to Microsoft". Ars Technica. Retrieved 2024-03-18.
- ^ Mulnix, David L. "Intel Xeon Processor Scalable Family Technical Overview". intel. Retrieved 3 September 2021.
- ^ Analysis of the Attack Surface of Windows 10 Virtualization-based Security
- ^ Arkley, Brent. "The potential performance Impact of Device Guard (HVCI)". Borec's Legacy meets Modern Device Management Blog. Retrieved 3 September 2021.
- ^ "AMD-V Rapid Virtualization Indexing and Windows Server 2008 R2 Hyper-V Second Level Address Translation". Doing IT Virtual. Retrieved 2010-05-16.
- ^ Bott, Ed (2011-12-08). "Does your PC have what it takes to run Windows 8's Hyper-V?". ZDNet. Retrieved 2014-02-17.
- ^ "Support & Drivers". Retrieved 13 December 2015.
- ^ "Hypervisor | Apple Developer Documentation".
- ^ "Kernel Newbies: Linux 2 6 26".
- ^ Sheng Yang (2008-06-12). "Extending KVM with new Intel Virtualization technology" (PDF). linux-kvm.org. KVM Forum. Archived from the original (PDF) on 2014-03-27. Retrieved 2013-03-17.
- ^ Inc, Parallels. "KB Parallels: What's new in Parallels Desktop 5 for Mac". kb.parallels.com. Retrieved 2016-04-12.
{{cite web}}:|last=has generic name (help) - ^ "Changelog for VirtualBox 2.0". Archived from the original on 2014-10-22.
- ^ liz. "VMware Workstation 14 Pro Release Notes". docs.vmware.com. Retrieved 2020-11-19.
- ^ "Benchmarks: Xen 3.2.0 on AMD Quad-Core Opteron with RVI". 2008-06-15. Retrieved 2011-05-13.
- ^ "Hardware Compatibility List (HCL)". Qubes OS. Retrieved 2020-01-06.
- ^ Implementation of a BIOS emulation support for BHyVe: A BSD Hypervisor
- ^ "21.7. FreeBSD as a Host with bhyve". Retrieved 13 December 2015.
- ^ Coming Soon to OpenBSD/amd64: A Native Hypervisor
- ^ vmm(4) — virtual machine monitor
- ^ ACRN Memory Management High-Level Design
- ^ "Features/VT-d - QEMU". wiki.qemu.org. Retrieved 2023-11-12.
- ^ "Hyper-V Enlightenments — QEMU documentation". www.qemu.org. Retrieved 2023-11-12.
- ^ "Add Intel VT-d nested translation [LWN.net]". lwn.net. Retrieved 2023-11-12.
- ^ "Intel Virtualisation: How VT-x, KVM and QEMU Work Together". Binary Debt. 2018-10-14. Retrieved 2023-11-12.
- ^ "Features/KVMNestedVirtualizationTestsuite - QEMU". wiki.qemu.org. Retrieved 2023-11-12.
External links
[edit]Second Level Address Translation
View on GrokipediaBackground
Address Translation Basics
In computer systems, memory address translation enables processes to operate within a virtual address space that is abstracted from the underlying physical memory. A virtual address (VA) is the memory address generated by a program or CPU instruction, while a physical address (PA) refers to the actual location in the system's RAM where data is stored. The Memory Management Unit (MMU), a hardware component integrated into the CPU, performs this translation by using data structures called page tables to map virtual pages to physical frames, thereby supporting features like memory protection, isolation, and efficient memory allocation.[1] Page tables implement this mapping through a hierarchical structure of entries that divide memory into fixed-size pages, typically 4 KB in x86 architectures. In basic 32-bit x86 paging, a two-level structure is used: a page directory (a 4 KB table with 1024 entries, each 4 bytes) serves as the top level, indexed by the upper 10 bits of the virtual page number (VPN), and points to page tables (also 4 KB each with 1024 page table entries (PTEs)), which are indexed by the next 10 bits of the VPN. Each PTE is 4 bytes and contains key fields: bit 0 (present bit) indicates if the page is mapped; bit 1 controls read/write permissions; bit 2 handles user/supervisor access; bits 3 and 4 manage caching (PWT and PCD); bit 5 tracks accessed state; bit 6 tracks dirty state; and bits 12-31 provide the 20-bit physical page base address for 4 KB pages. Large pages of 2 MB or 4 MB can be supported directly via page directory entries (PDEs) with similar fields but shifted base addresses. The CR3 control register holds the physical base address of the page directory, which the MMU loads on context switches to apply per-process mappings.[1] The evolution from 32-bit to 64-bit paging addressed limitations in addressable memory. Introduced with the Pentium Pro processor, Physical Address Extension (PAE) extended 32-bit virtual addressing to support up to 36-bit (64 GB) physical addressing by expanding PTEs and PDEs to 64 bits and adding a page directory pointer table (PDPT) as a third level, with CR3 pointing to the PDPT's 4 entries (each selecting a 1 GB region). In 64-bit x86 (Intel 64 architecture), paging uses a four-level hierarchy—PML4 (page map level 4), PDPT, PD, and PT—with 512 entries per level (9 bits each, plus 12-bit offset for 48-bit virtual addresses), enabling up to 256 TB virtual and up to 4 PB physical space (implementation-dependent) using 4 KB pages; large pages of 2 MB (via PD) and 1 GB (via PDPT) reduce table overhead with analogous entry formats but larger base fields (up to 52 bits for physical addresses). This structure maintains backward compatibility while scaling for larger systems.[1] The address translation process splits the VA into a VPN and offset, then performs sequential lookups. For a 32-bit VA with 4 KB pages: where VPN (20 bits) is divided into directory index (10 bits) and table index (10 bits). The MMU indexes the page directory with the directory index to find the page table base, indexes that table with the table index to retrieve the PPN (20 bits), and constructs the PA as: If the present bit is unset, a page fault occurs, allowing the OS to handle missing pages. Similar logic applies to multi-level 64-bit and PAE structures, with additional indices for each level.[1]Virtualization Challenges
In virtualized environments, memory address translation involves three distinct address spaces: guest virtual addresses (GVAs), which are used by applications running within a virtual machine (VM); guest physical addresses (GPAs), which represent the physical memory as perceived by the guest operating system; and host physical addresses (HPAs), which correspond to the actual physical memory locations on the host hardware.[2] This multi-level mapping arises because the guest OS manages its own page tables for GVA-to-GPA translations, unaware that its GPAs are themselves virtualized and must be mapped to HPAs by the hypervisor (or virtual machine monitor, VMM).[3] Prior to hardware support for second-level address translation, hypervisors relied on shadow paging to emulate guest memory management. In this approach, the hypervisor maintains shadow page tables that directly map GVAs to HPAs by combining the guest's page tables (GVA-to-GPA) with its own host page tables (GPA-to-HPA), allowing the CPU's hardware walker to perform translations without guest awareness.[2] However, to ensure consistency, the hypervisor must trap and emulate any guest modifications to its page tables, such as writes to control registers like CR3 or page table entries, leading to frequent VM exits. Additionally, every guest TLB miss or page fault triggers a trap to the hypervisor, which then walks the guest tables, applies host mappings, and updates the shadow structures on demand.[3] These mechanisms impose significant challenges in virtualized systems. High CPU overhead stems from VM exits on page faults and TLB misses, with each exit-entry pair costing over 1,000 CPU cycles on typical x86 hardware, potentially amplifying page fault latency by factors of 10 to 50 times compared to native execution in workloads with frequent memory accesses.[3] Scalability suffers with multiple VMs, as the hypervisor must maintain separate shadow page tables for each, increasing memory consumption and management complexity proportional to the number of guests.[4] Furthermore, resource allocation techniques like memory ballooning—where the hypervisor inflates a balloon driver in the guest to reclaim idle pages for host use—exacerbate issues by inducing additional guest-level paging and faults, straining the already overhead-heavy shadow paging system during overcommitment scenarios.[5]Core Concepts
First-Level Address Translation
First-level address translation refers to the process by which a guest operating system (OS) in a virtualized environment maps guest virtual addresses (GVAs) to guest physical addresses (GPAs) using its own page tables, independent of the host system's physical memory layout. This mechanism is fundamental to memory management in the guest, mirroring non-virtualized paging but operating within the virtual machine's allocated address space. It relies on the memory management unit (MMU) of the underlying hardware to perform the translation, caching results in the translation lookaside buffer (TLB) for efficiency. The step-by-step process begins when the guest OS issues a GVA during memory access. The MMU uses the GVA to index into the guest's multi-level page table hierarchy, starting from the base address stored in a dedicated register. Each level of the page table provides an index to the next level until reaching the page table entry (PTE) that maps to the final GPA. If the entry is valid, the translation succeeds, and the access proceeds to the GPA; otherwise, a page fault is generated within the guest OS, which handles allocation or swapping as needed. TLB caching accelerates subsequent accesses by storing recent GVA-to-GPA mappings, with invalidations triggered by guest OS actions like context switches. In x86 architectures, the guest page directory base is held in the CR3 control register, which the guest OS loads during process switches to point to the current page directory. For 64-bit systems, the hierarchy typically includes a page map level 4 (PML4), page directory pointer tables, page directories, and page tables, enabling addressing of up to 2^48 bytes of virtual memory under 4KB pages. Permission checks occur at each level, including read/write access bits and the no-execute (NX) bit in the PTE to enforce data execution prevention. (Intel SDM Volume 3A, Chapter 4) The ARM architecture provides an equivalent for first-level translation at Exception Level 1 (EL1), the guest OS privilege level, using Translation Table Base Registers TTBR0 and TTBR1 to hold the base addresses of the stage-1 translation tables. These support multi-level tables (e.g., level 3 to level 0) for 4KB granule pages, with similar permission attributes like access permissions and execute-never flags checked during traversal. (ARM Architecture Reference Manual, Armv8-A) The first-level translation can be expressed as: with a guest page fault if the entry is invalid or permissions are violated.Second-Level Address Translation Mechanism
Second Level Address Translation (SLAT) implements a two-stage address translation process to enable efficient memory virtualization without relying on software emulation or shadow paging techniques. In the first stage, the memory management unit (MMU) translates the guest virtual address (GVA) to a guest physical address (GPA) using the guest operating system's page tables. The second stage then maps the GPA to the host physical address (HPA) via hypervisor-controlled SLAT structures, allowing the hardware to compose the full translation path directly. This process preserves the guest's view of its memory while ensuring isolation and mapping to host resources.[6][7] SLAT table structures mirror conventional page tables but operate on the intermediate GPA space to produce HPAs. Each entry typically contains the host physical base address, permissions enforced by the hypervisor (such as read, write, and execute controls), and reserved or ignored bits that maintain guest isolation by preventing the guest from inferring host layout details. Additionally, many SLAT implementations support hardware-updated accessed and dirty flags in entries, which the host can use for memory management without guest involvement, similar to standard paging mechanisms but applied at the host level. These structures are pointed to by a dedicated register or control, enabling the hypervisor to switch contexts efficiently across virtual machines.[7][8] In hardware operation, upon a TLB miss during guest execution, the MMU initiates a combined page walk: it first traverses the guest's page tables to derive the GPA from the GVA, then immediately walks the SLAT tables to obtain the HPA, appending the original page offset to form the final physical address. The full translation can be represented as: where the guest tables handle the first-level mapping and SLAT the second. Virtual machine exits to the hypervisor occur primarily for second-level faults (e.g., invalid GPA mappings or permission violations at the host level), while many first-level faults can be handled directly by the guest OS, minimizing context switches and overhead. This integrated flow contrasts with shadow paging approaches, which require hypervisor intervention for nearly every guest page fault to synchronize duplicated tables.[6][7][8] By providing hardware acceleration for the complete translation chain, SLAT significantly reduces the performance penalty of virtualization, enabling near-native memory access speeds in many workloads. Early implementations supported host physical address spaces up to 48 bits, sufficient for systems with gigabytes of RAM at the time of introduction. This mechanism has become foundational for modern hypervisors, supporting scalable deployment of virtual machines without prohibitive translation costs.Hardware Implementations
Intel Extended Page Tables
Intel Extended Page Tables (EPT) were introduced in 2008 with the Nehalem microarchitecture as part of the Virtual Machine Extensions (VMX) to enable efficient second-level address translation in virtualized environments.[9] EPT provides hardware support for mapping guest physical addresses directly to host physical addresses, reducing the overhead associated with software-managed shadow paging. The EPT pointer, known as the Extended-Page-Table Pointer (EPTP), is a 64-bit field stored in the Virtual Machine Control Structure (VMCS), with bits 51:12 specifying the physical base address of the EPT paging structures.[9] The EPT hierarchy mirrors the guest's paging structures and consists of up to four levels: a Page Map Level 4 (PML4) table, a Page Directory Pointer Table (PDPT), a Page Directory (PD), and a Page Table (PT).[9] It supports page sizes of 4 KB, 2 MB, and 1 GB, with entries formatted to include a host physical address offset spanning bits 51:12 (effectively a 40-bit address for 48-bit physical memory), along with permission bits for read, write, and execute access.[9] Additional bits in EPT entries include execute-host protection and suppression of execute-host/disable controls to optimize virtualization performance. A key feature is the unrestricted guest mode, which allows the guest to execute in unpaged or real-address mode (with CR0.PG=0) without triggering VM exits, thereby reducing context switches.[9] EPT violations, such as unauthorized read or write access, cause a VM exit with an associated exit qualification providing error codes to inform the hypervisor of the fault type.[9] EPT evolved across subsequent processor generations to address growing memory demands and enhance efficiency. Host physical address support remained at 36 bits in the Westmere microarchitecture (2010). It was expanded to 39 bits in the Haswell microarchitecture (2013).[9] The Haswell microarchitecture introduced Accessed and Dirty (A/D) bits in EPT entries, allowing hardware to track memory access and modification without requiring hypervisor or guest OS intervention, which improves performance in dynamic virtualization scenarios.[9] Subsequent generations, such as Ice Lake (2019), introduced 5-level EPT support, enabling up to 57-bit virtual addressing and up to 52-bit physical addressing in later implementations like Sapphire Rapids (2023).[10] The first commercial deployment of EPT occurred with VMware ESXi 4.0, released on May 21, 2009, which integrated hardware-assisted MMU virtualization to leverage EPT for reduced memory management overhead.[11][12]AMD Nested Page Tables
AMD's second-level address translation, known as Nested Page Tables (NPT), was introduced as part of the Secure Virtual Machine (SVM) extensions with the Family 10h processors, including the Barcelona core, in 2007. Originally developed under the name Rapid Virtualization Indexing (RVI), it provides hardware support for translating guest physical addresses to host physical addresses without the overhead of shadow paging. The NPT root pointer is stored in the nCR3 field of the Virtual Machine Control Block (VMCB), which holds the host physical address of the top-level NPT.[13][14] The NPT employs a four-level page table hierarchy compatible with 48-bit guest physical addressing, mirroring the structure of standard x86-64 page tables. Each NPT entry includes a 40-bit base address pointing to the next level or page frame, along with permission bits for read (R), write (W), and execute (X) access, which are enforced in combination with guest page table permissions—the stricter rule applies. Additional host-mode-only bits control features like caching and presence, ensuring secure isolation. NPT supports standard page sizes of 4 KB, 2 MB, and 1 GB, allowing hypervisors to optimize memory mappings with large pages to reduce translation overhead.[15][14] Nested paging is enabled via the NP_ENABLE bit in the VMCB during SVM operations, with support detectable through CPUID function 8000_000A EDX. When address translation fails, the hardware performs a two-stage walk: first through guest page tables to obtain a guest physical address, then through NPT to the host physical address. Page faults are classified as guest faults (handled by the guest OS) or nested page faults (NPF), which trigger a VMEXIT to the hypervisor; the NPF includes an error code in EXITINFO1 and the faulting guest physical address in EXITINFO2 for efficient diagnosis and resolution. This mechanism directs faults appropriately while minimizing hypervisor intervention in valid translations.[14][15] The RVI branding was phased out after 2010, with official documentation standardizing on NPT thereafter. In the 2011 Bulldozer microarchitecture (Family 15h), NPT gained enhanced support for 9-bit superpage indexing, facilitating more efficient handling of large 1 GB pages in virtualized environments. The 2017 Zen microarchitecture further evolved NPT performance through larger TLBs and improved page walk caches, reducing latency for nested translations and boosting overall virtualization efficiency. NPT received early software support, with integration into the Linux Kernel-based Virtual Machine (KVM) hypervisor beginning in 2007 alongside initial SVM compatibility.[15][16]ARM Stage-2 Translation
Second Level Address Translation in ARM architecture is realized through Stage-2 translation, introduced as part of the Virtualization Extensions in the ARMv7-A profile in 2011. This mechanism enables hypervisors operating at Exception Level 2 (EL2) to translate Intermediate Physical Addresses (IPAs)—equivalent to guest physical addresses generated by the guest's Stage-1 translation—to final Physical Addresses (PAs), ensuring memory isolation between virtual machines.[17][18] Stage-2 translation tables are configured via the VTTBR_EL2 register, which specifies the base address of the table and includes a Virtual Machine Identifier (VMID) for disambiguating translations in the TLB. The table structure supports 3-level or 4-level hierarchies depending on the IPA width and implementation, such as 40-bit IPAs in ARMv8 configurations. Descriptors within these tables contain the base PA, access permissions (read/write/execute), and PXN/UXN bits to prevent execution of sensitive code at privileged or unprivileged levels, respectively, thereby enforcing security boundaries.[8][19] During address translation, the hardware performs sequential walks of the guest's Stage-1 tables (at EL1) followed by Stage-2 tables (at EL2), applying the more restrictive attributes from either stage. A Stage-2 fault, such as permission violations or invalid mappings, generates an exception directly to EL2 for hypervisor intervention, integrating seamlessly with ARM's exception model. Supported page and block sizes include 4 KB, 64 KB, and 2 MB, aligning with the architecture's memory management granularity.[8][20] The feature evolved significantly in ARMv8 (2013), which introduced 64-bit addressing support and the VTCR_EL2 register to control Stage-2 parameters like granule size and table levels. Subsequent updates, including ARMv8.6 in 2020, added enhancements for advanced virtualization, such as improved support for confidential computing environments.[8][21] ARMv9 (2022) and later extensions like Armv9.5 further enhance Stage-2 with features such as hardware dirty bit support (FEAT_HDBSS) for better memory tracking in virtualized environments.[22] This Stage-2 implementation is widely adopted in mobile and server SoCs, exemplified by Apple's M-series processors (introduced 2020) and AWS Graviton processors for cloud virtualization.[23]Extensions and Features
Mode-Based Execution Control
Mode-Based Execution Control (MBEC) is an extension to Intel's Extended Page Tables (EPT), part of the VT-x virtualization technology, that provides mode-specific execute permissions for guest-physical addresses (GPAs) in virtualized environments. Introduced with Skylake-generation processors in 2015, MBEC allows hypervisors to enforce different executability rules based on the privilege level (supervisor or user mode) of the accessing linear address, enhancing security by preventing unauthorized code execution across privilege boundaries without requiring full emulation by the hypervisor.[24][25] This feature builds on standard EPT by augmenting paging-structure entries to support granular control, reducing virtual machine exits (VM exits) that would otherwise occur during privilege-mode switches in guest code execution.[25] The mechanism relies on specific fields in the Virtual Machine Control Structure (VMCS) to enable and configure MBEC. The secondary processor-based VM-execution control bit 22 ("mode-based execute control for EPT") must be set to 1, which requires the "activate secondary controls" bit (bit 13 in the primary controls) to also be enabled, alongside the "enable EPT" control (bit 1 in primary controls).[25] The Extended-Page-Table Pointer (EPTP) in the VMCS (bits 51:12) points to the base of the EPT paging structures, where MBEC-augmented entries define permissions. In EPT page-table entries (PTEs), two dedicated bits provide the mode-specific controls: bit 2 indicates execute permission for supervisor-mode linear addresses (CPL=0), while bit 10 indicates execute permission for user-mode linear addresses (CPL=3).[25][26] These bits operate independently of the standard execute bit (bit 6), allowing configurations where, for example, a GPA is executable only in supervisor mode. MBEC integrates with broader privilege enforcement mechanisms like Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode Access Prevention (SMAP) by complementing their first-level page-table protections in the guest context, ensuring consistent enforcement across the two-stage translation process.[25] MBEC is particularly useful for optimizing nested virtualization scenarios, where hypervisors manage multiple layers of guests with varying privilege requirements, or for supporting legacy operating systems by isolating code execution without excessive overhead from mode emulation.[25] It also aids in securing environments like Intel Software Guard Extensions (SGX) enclaves by restricting user-mode execution on sensitive GPAs, thereby protecting system code integrity from malicious modifications.[25] However, MBEC applies solely to instruction fetches (code execution attempts) and does not affect data reads or writes; it requires EPT to be fully enabled and is undefined if the control bit is not set appropriately.[25] Processor support for MBEC can be enumerated via the IA32_VMX_EPT_VPID_CAP MSR, ensuring compatibility in virtualized deployments.[25]Secure Nested Paging
Secure Nested Paging (SNP) extends second-level address translation (SLAT) mechanisms with hardware-accelerated memory encryption and integrity protections, enabling confidential computing environments where virtual machines (VMs) are isolated from hypervisor or host attacks. These extensions apply encryption keys transparently during SLAT walks from guest physical addresses (GPAs) to host physical addresses (HPAs), ensuring that memory contents remain confidential even if an attacker gains control of the host. Integrity features prevent unauthorized modifications, such as remapping or replay attacks, by validating page states and assignments during translation. SNP is particularly suited for confidential VMs, where the hypervisor acts as an untrusted entity, and faults are generated if key mismatches or integrity violations occur. Intel's implementation, known as Secure Extended Page Tables (SEPT), integrates with Total Memory Encryption (TME) and Multi-Key Total Memory Encryption (MKTME) as part of Trust Domain Extensions (TDX), first available in 4th Gen Intel Xeon Scalable (Sapphire Rapids) processors in 2023, with broader availability in 5th Gen Xeon Scalable processors as of 2024.[27] TME provides system-wide encryption using a single key derived from hardware fuses, while MKTME extends this to up to 511 unique keys for finer granularity, allowing per-VM or per-tenant isolation. In SEPT, EPT entries include key identifiers (KeyIDs) that reference MKTME keys, applying encryption during the GPA-to-HPA translation without software intervention; this protects against physical attacks like cold boot or bus snooping. SEPT is a core component of Intel TDX, where the TDX module manages SEPT structures to enforce both confidentiality and replay-protected integrity for private memory pages. As of 2025, TDX support has expanded in cloud environments and Linux KVM. AMD's Secure Nested Paging builds on Secure Memory Encryption (SME), introduced with EPYC "Naples" processors in 2017, and Secure Encrypted Virtualization (SEV) for per-VM keys. SEV sets a unique encryption key in the Nested Page Table (NPT) via a dedicated bit, enabling transparent encryption of guest memory during SLAT walks to thwart host-based attacks. SEV with Encrypted State (SEV-ES), available since EPYC "Rome" in 2019, extends this by encrypting VM register states for secure migration, while maintaining memory confidentiality. For integrity, AMD's SEV-Secure Nested Paging (SEV-SNP), introduced in EPYC "Milan" in 2021, uses a Reverse Map Protection (RMP) table alongside NPT to assign pages to specific VMs and validate states (e.g., assigned or locked), preventing remapping or injection attacks without relying on explicit hash chains but through hardware-enforced checks during translation. As of Linux kernel 6.11 in 2024, KVM added guest support for SEV-SNP.[28] ARM's equivalent is the Realm Management Extension (RME), part of the Armv9-A architecture released in 2022, which introduces "Realms" as secure execution environments using stage-2 translations for isolation. RME employs granular memory tagging and encryption keys applied during stage-2 walks, with the Realm Management Monitor (RMM) handling key derivation and attestation. Pages in Realm physical address space (RPAS) are encrypted per-Realm, and integrity is ensured via non-extendable stage-2 mappings that fault on mismatches; remote attestation tokens verify Realm configurations and measurements to attest to secure provisioning. As of 2025, RME is implemented in production SoCs such as those based on Cortex-X925 cores. In these mechanisms, hardware applies keys at the memory controller during SLAT traversal, encrypting data on writes and decrypting on reads, with page faults triggered for key or integrity mismatches to prevent unauthorized access. This is tailored for confidential VMs, where guest memory is inaccessible in plaintext to the host or hypervisor. For example, AMD SEV has been supported in Linux KVM since kernel version 4.15 in 2018, providing protection against host attacks such as Rowhammer by ensuring bit flips affect only encrypted data.[29]Software Integration
Hypervisor Support
VMware vSphere and ESXi have supported second-level address translation (SLAT) since version 4.0, released in 2009, enabling the use of Intel Extended Page Tables (EPT) and AMD Nested Page Tables (NPT) for hardware-assisted memory virtualization. The hypervisor automatically detects compatible hardware and configures SLAT accordingly, eliminating the need for software-managed shadow page tables and thereby reducing memory overhead associated with page table synchronization in overcommitted environments. Microsoft Hyper-V integrated SLAT support starting with Windows Server 2008 R2 in 2009, leveraging EPT for Intel processors and NPT for AMD to streamline guest-to-host address translations.[30] This enables efficient dynamic memory allocation, where the hypervisor can adjust VM memory usage in real-time based on demand, improving resource utilization without guest OS modifications.[31] In open-source environments, KVM paired with QEMU utilizes SLAT through kernel modules such as kvm-intel for EPT and kvm-amd for NPT, allowing seamless hardware acceleration when available on the host CPU. Libvirt provides APIs for configuring SLAT in KVM-based VMs, including options to enable or disable nested paging via domain XML attributes, facilitating programmatic management of virtualization features.[32] Xen employs hardware-assisted paging (HAP) as its primary SLAT mechanism when supported by the hardware, falling back to software shadow paging only if SLAT is unavailable to maintain compatibility.[33] Benchmarks indicate that SLAT integration across these hypervisors yields significant performance improvements, particularly in memory-intensive workloads (up to 6x in microbenchmarks), compared to shadow paging, primarily by reducing VM exits and translation overhead.[12] As of 2025, major hypervisors such as Microsoft Hyper-V require SLAT for installation and operation, while others like VMware ESXi, KVM, and Xen strongly recommend it for optimal performance in production environments, with fallback to software shadow paging available but suboptimal.[34]Guest OS Interactions
Guest operating systems operate within SLAT-enabled virtualization environments in a fully transparent manner, relying on their conventional page tables for memory management without any knowledge of the second-level translation layer. The hypervisor intercepts modifications to the guest's CR3 register, which points to the base of the guest's page directory, and uses this information to update the SLAT root structure—such as the Extended Page Table Pointer (EPTP) on Intel platforms—ensuring that guest-physical addresses are correctly mapped to host-physical addresses. This interception occurs via hardware virtualization extensions, allowing the guest to function as if running on bare metal while the hypervisor maintains isolation and control.[35] When a guest attempts to access memory, its page tables are walked first to derive a guest-physical address; if valid, the SLAT then performs the second translation to host-physical memory, all without guest involvement. Guest-initiated page faults, arising from invalid mappings in the guest's page tables, result in standard virtualization traps that the hypervisor resolves by emulating or forwarding the fault, potentially triggering an SLAT violation if the guest-physical address lacks a corresponding host mapping. No modifications to the guest operating system are required for basic SLAT functionality, preserving compatibility across unmodified binaries and enabling seamless migration from physical to virtual deployments. However, paravirtualization techniques, such as the Virtio-balloon driver, can optimize interactions by allowing the guest to cooperatively inflate or deflate memory balloons, reducing SLAT pressure from overcommitment and improving overall resource efficiency in dense environments.[35][36] In Linux-based guests, SLAT facilitates direct I/O memory management unit (IOMMU) mappings for device passthrough via VFIO, where assigned devices perform direct memory access (DMA) to guest-physical addresses that SLAT translates to host-physical ones, bypassing hypervisor mediation for low-latency I/O. Windows guests similarly benefit from SLAT's support for large page mappings, including 2 MB or 1 GB huge pages in the guest's address space, which the hypervisor can mirror in the SLAT structures to minimize translation lookaside buffer (TLB) misses and enhance performance for memory-intensive workloads. These capabilities were first integrated into major guest kernels around 2008, with KVM enabling EPT support by Linux kernel version 2.6.26, and have become standard across server operating systems by 2025.[37][38][39] In edge cases like nested virtualization, where a guest acts as a hypervisor running its own virtual machines, SLAT operates in a layered fashion: the outer hypervisor's SLAT maps the guest hypervisor's physical addresses, while the guest hypervisor manages an inner SLAT (or equivalent) for its nested guests, requiring explicit enablement of nested paging extensions to avoid excessive VM exits. This setup demands coordination between the outer hypervisor and the guest hypervisor to propagate translations correctly, often using vendor-specific controls like Intel's "unrestricted guest" mode or AMD's nested page table enhancements.[40]Performance and Security
Efficiency Gains
Second Level Address Translation (SLAT) markedly reduces VM exits compared to shadow paging by offloading address translation to hardware, eliminating the need for hypervisor intervention on guest page table modifications and faults. Shadow paging can incur thousands of VM exits per second due to synchronization overhead, while SLAT limits these to hundreds or fewer in typical workloads, resulting in substantial cuts in context switch costs.[12][41] SLAT enhances TLB and cache efficiency through hardware-managed combined guest-host page walks, which support larger page sizes in Extended Page Tables (EPT) or Nested Page Tables (NPT) for improved hit rates. Native page walks are faster than emulated shadow paging; SLAT's nested walks add latency but impose modest overhead on modern hardware, far outperforming software emulation.[42][43] In benchmarks, SLAT yields significant throughput gains; for instance, Intel EPT improves SPECjbb2005 performance by up to 6.4x with large pages, while AMD NPT delivers up to 3.7x in similar tests. These efficiencies enable greater memory consolidation, supporting higher VM densities per host. Compared to software-only approaches, SLAT facilitates live migration with minimal downtime by streamlining memory state handling.[12][44]Potential Vulnerabilities
Second Level Address Translation (SLAT) introduces several security risks in virtualized environments, primarily due to its role in managing memory isolation between guest virtual machines (VMs) and the host hypervisor. One prominent vulnerability is the Meltdown attack, disclosed in 2018, which exploits speculative execution during SLAT page table walks to leak privileged memory, including hypervisor data, by causing EPT violations in Intel systems that trigger VM exits and allow unauthorized reads of kernel or host memory.[45][43] Similarly, the L1 Terminal Fault (L1TF) vulnerability, also revealed in 2018, enables speculative access to data in the L1 data cache through faulty SLAT mappings, potentially exposing host or other guest data across VM boundaries in EPT or NPT implementations.[46] Misconfigurations in EPT (Intel) or NPT (AMD) structures can further compromise isolation, permitting guest VMs to perform unauthorized accesses that escalate to host escapes, such as by incorrectly mapping guest-physical addresses to host memory regions, bypassing intended protections.[47] Rowhammer attacks, which induce bit flips in DRAM by repeatedly accessing adjacent rows, are amplified in SLAT-enabled environments where shared physical memory mappings allow a malicious guest to target and corrupt data in other VMs or the host through manipulated page allocations; as of 2025, research highlights inter-VM Rowhammer risks and mitigations like Copy-on-Flip for ECC memory.[48][49] To counter these threats, hardware and software mitigations have been developed. The Indirect Branch Prediction Barrier (IBPB), introduced in 2018 via microcode updates, serializes indirect branch predictions to prevent speculative execution leaks across privilege levels, including those involving SLAT walks, and has been integrated into major hypervisors since its rollout. AMD's Secure Encrypted Virtualization (SEV) extension encrypts guest memory using per-VM keys during NPT translations, protecting against physical and some side-channel attacks on SLAT structures. ARM's Pointer Authentication Codes (PAC), introduced in ARMv8.3 in 2016, integrate with stage-2 translations by signing pointers to detect corruptions, enhancing resistance to exploits that target virtualized memory integrity. Microcode patches addressing these issues, including for Meltdown and L1TF, have been available since 2018 and are routinely applied in production systems. SLAT mechanisms are commonly used in confidential computing environments, such as Trusted Execution Environments (TEEs), to ensure attested isolation in cloud deployments handling sensitive workloads. However, advanced security features like SEV's encrypted page tables introduce trade-offs, imposing approximately 5-10% performance overhead due to additional encryption operations during address translations.[50]References
- https://wiki.xenproject.org/wiki/Tuning_Xen_for_Performance
