Hubbry Logo
Cache inclusion policyCache inclusion policyMain
Open search
Cache inclusion policy
Community hub
Cache inclusion policy
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Cache inclusion policy
Cache inclusion policy
from Wikipedia

Multi-level caches can be designed in various ways depending on whether the content of one cache is present in other levels of caches. If all blocks in the higher level cache are also present in the lower level cache, then the lower level cache is said to be inclusive of the higher level cache. If the lower level cache contains only blocks that are not present in the higher level cache, then the lower level cache is said to be exclusive of the higher level cache. If the contents of the lower level cache are neither strictly inclusive nor exclusive of the higher level cache, then it is called non-inclusive non-exclusive (NINE) cache.[1][2]

Inclusive Policy

[edit]
Figure 1. Inclusive Policy

Consider an example of a two level cache hierarchy where L2 can be inclusive, exclusive or NINE of L1. Consider the case when L2 is inclusive of L1. Suppose there is a processor read request for block X. If the block is found in L1 cache, then the data is read from L1 cache and returned to the processor. If the block is not found in the L1 cache, but present in the L2 cache, then the cache block is fetched from the L2 cache and placed in L1. If this causes a block to be evicted from L1, there is no involvement of L2. If the block is not found in either L1 or L2, then it is fetched from the main memory and placed in both L1 and L2. Now, if there is an eviction from L2, the L2 cache sends a back invalidation to the L1 cache, so that inclusion is not violated.

As illustrated in Figure 1, initially consider both L1 and L2 caches to be empty (a). Assume that the processor sends a read X request. It will be a miss in both L1 and L2 and hence the block is brought into both L1 and L2 from the main memory as shown in (b). Now, assume the processor issues a read Y request which is a miss in both L1 and L2. So, block Y is placed in both L1 and L2 as shown in (c). If block X has to be evicted from L1, then it is removed from L1 only as shown in (d). If block Y has to be evicted from L2, it sends a back invalidation request to L1 and hence block Y is evicted from L1 as shown in (e).

In order for inclusion to hold, certain conditions need to be satisfied. L2 associativity must be greater than or equal to L1 associativity irrespective of the number of sets. The number of L2 sets must be greater than or equal to the number of L1 sets irrespective of L2 associativity. All reference information from L1 is passed to L2 so that it can update its replacement bits.

One example of inclusive cache is Intel quad core processor with 4x256KB L2 caches and 8MB (inclusive) L3 cache.[3]

Exclusive Policy

[edit]
Figure 2. Exclusive policy

Consider the case when L2 is exclusive of L1. Suppose there is a processor read request for block X. If the block is found in L1 cache, then the data is read from L1 cache and returned to the processor. If the block is not found in the L1 cache, but present in the L2 cache, then the cache block is moved from the L2 cache to the L1 cache. If this causes a block to be evicted from L1, the evicted block is then placed into L2. This is the only way L2 gets populated. Here, L2 behaves like a victim cache. If the block is not found in either L1 or L2, then it is fetched from main memory and placed just in L1 and not in L2.

As illustrated in Figure 2, initially consider both L1 and L2 caches to be empty (a). Assume that the processor sends a read X request. It will be a miss in both L1 and L2 and hence the block is brought into L1 from the main memory as shown in (b). Now, again the processor issues a read Y request which is a miss in both L1 and L2. So, block Y is placed in L1 as shown in (c). If block X has to be evicted from L1, then it is removed from L1 and placed in L2 as shown in (d).

An example of exclusive cache is AMD Opteron with 512 KB (per core) L2 cache, exclusive of L1.[3]

NINE Policy

[edit]
Figure 3. NINE policy

Consider the case when L2 is non-inclusive non-exclusive of L1. Suppose there is a processor read request for block X. If the block is found in L1 cache, then the data is read from L1 cache and returned to the processor. If the block is not found in the L1 cache, but present in the L2 cache, then the cache block is fetched from the L2 cache and placed in L1. If this causes a block to be evicted from L1, there is no involvement of L2, which is the same as in the case of inclusive policy. If the block is not found in both L1 and L2, then it is fetched from main memory and placed in both L1 and L2. Now, if there is an eviction from L2, unlike inclusive policy, there is no back invalidation.

As illustrated in Figure 3, initially consider both L1 and L2 caches to be empty (a). Assume that the processor sends a read X request. It will be a miss in both L1 and L2 and hence the block is brought into both L1 and L2 from the main memory as shown in (b). Now, again the processor issues a read Y request which is a miss in both L1 and L2. So, block Y is placed in both L1 and L2 as shown in (c). If block X has to be evicted from L1, then it is removed from L1 only as shown in (d). If block Y has to be evicted from L2, it is evicted from L2 only as shown in (e).

An example of non-inclusive non-exclusive cache is AMD Opteron with non-inclusive L3 cache of 6 MB (shared).[3]

Comparison

[edit]

The merit of inclusive policy is that, in parallel systems with per-processor private cache if there is a cache miss other peer caches are checked for the block. If the lower level cache is inclusive of the higher level cache and it is a miss in the lower level cache, then the higher level cache need not be searched. This implies a shorter miss latency for an inclusive cache compared to exclusive and NINE.[1]

A drawback of an inclusive policy is that the unique memory capacity of the cache is determined by the lower level cache. Unlike the case of exclusive cache, where the unique memory capacity is the combined capacity of all caches in the hierarchy.[4] If the size of lower level cache is small and comparable with the size of higher level cache, there is more wasted cache capacity in inclusive caches. Although the exclusive cache has more unique memory capacity, it uses more bandwidth since it suffers from a higher rate of filling of new blocks (equal to the rate of higher level cache's misses) as compared to NINE cache which is filled with a new block only when it suffers a miss. Therefore, assessment of cost relative to benefit needs to be done while exploiting the choice between Inclusive, Exclusive and NINE caches.

Value Inclusion: It is not necessary for a block to have the same data values when it is cached in both higher and lower level caches even though inclusion is maintained. But, if the data values are the same, value inclusion is maintained.[1] This depends on the write policy in use, as write back policy does not notify the lower level cache of the changes made to the block in higher level cache. However, in case of write-through cache there is no such concern.

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In , a cache inclusion policy defines the relationship between data stored in different levels of a multi-level , determining whether blocks in a higher-level cache (closer to the processor, such as L1) must also reside in lower-level caches (farther from the processor, such as L2 or L3), thereby influencing effective cache capacity, coherence mechanisms, and overall system performance. These policies are critical in modern processors to balance latency, bandwidth, and storage efficiency in the memory subsystem. There are three primary types of cache inclusion policies: inclusive, exclusive, and non-inclusive (also known as non-inclusive non-exclusive or NINE). In an inclusive policy, all data blocks from higher-level caches are subsets of the lower-level caches, meaning a block in L1 must also exist in L2 or L3, which simplifies coherence protocols by avoiding the need for complex tracking but leads to data replication and reduced effective capacity. Conversely, an exclusive policy ensures that a data block resides in only one cache level at a time, with lower levels acting as victim caches for blocks from higher levels, maximizing unique storage and effective capacity but increasing coherence complexity and data movement overhead. A non-inclusive policy falls between these extremes, allowing blocks to exist in multiple levels without strict of inclusion or exclusion, offering flexibility in cache management while requiring additional mechanisms like shadow tags for coherence in chip multiprocessor (CMP) systems. The choice of inclusion policy significantly impacts cache management techniques, such as replacement and prefetching algorithms, with exclusive policies often benefiting multi-core workloads through higher , while inclusive policies reduce off-chip bandwidth demands at the cost of single-core performance. As of 2025, modern processors implement these policies variably; for instance, Intel's and later architectures (such as Arrow Lake) use non-inclusive L3 caches, whereas AMD's Zen series (up to ) employs mostly exclusive L3 caches to optimize for multi-threaded applications. Evaluations using benchmarks like SPEC CPU2006 show that exclusive policies can yield up to 38.9% speedup in multi-core scenarios with tailored replacement policies, highlighting the ongoing evolution of these designs to address processor-memory speed gaps.

Overview

Definition and Purpose

A cache inclusion policy defines the relationship between data blocks across multiple levels of a processor's , specifying whether blocks present in an inner-level cache (such as L1, which is closest to the CPU core) must also reside in an outer-level cache (such as L2 or the last-level cache, LLC). This policy ensures data consistency by dictating inclusion or exclusion rules, thereby facilitating efficient access patterns in systems where inner caches prioritize speed and outer caches emphasize capacity. The purpose of cache inclusion policies is to optimize data duplication in multi-level hierarchies, striking a balance between maximizing usable cache capacity, minimizing access latency, and reducing the overhead associated with cache coherence protocols. These policies emerged as a response to the widening performance gap between rapidly advancing CPU clock speeds and slower main memory latencies, enabling processors to exploit temporal and spatial locality more effectively without excessive redundancy. By governing how data propagates between levels, they support scalable designs in both single- and multi-core architectures, where coherence complexity can otherwise escalate with additional cache tiers. Cache inclusion policies were first formalized in academic research during the late , amid the rise of multi-level cache designs for multiprocessors. They gained practical prominence in the 1990s with the commercialization of multi-level caches, notably in the processor released in 1995, which featured an on-package L2 cache integrated with the L1. At a basic level, these policies manage and fill operations in multi-level caches by ensuring that data movements between inner and outer levels preserve overall , such as propagating evictions from inner caches to outer ones to avoid inconsistencies or data loss. This foundational mechanic underpins reliable data flow, allowing outer caches to serve as backing stores while inner caches deliver low-latency hits.

Role in Cache Hierarchies

In multi-level cache hierarchies common to modern processors, the structure typically features private Level 1 (L1) caches per core, which are small (often 32-64 KB) and optimized for low latency, followed by private or shared Level 2 (L2) caches of moderate size (256 KB to 1 MB per core), and a shared last-level cache (LLC, usually L3) that is significantly larger (several MB to tens of MB) to serve multiple cores. Inclusion policies operate between these levels, dictating whether the contents of inner caches (L1 and L2) must be subsets of the outer caches (L2 or LLC), thereby influencing data placement and movement across the hierarchy. This setup balances speed, capacity, and sharing, with inclusion applied primarily between L1-L2 and L2-LLC to manage locality and coherence in multi-core environments. Inclusion policies interact closely with replacement policies, affecting how blocks are evicted and handled during cache misses or capacity overflows. For instance, in inclusive hierarchies, replacement algorithms in inner caches must notify outer caches to maintain the superset property, often using counters or inclusion bits to track and enforce data presence, which can complicate eviction decisions compared to non-inclusive setups. In exclusive hierarchies, victim blocks from inner cache evictions are inserted into the next level. In inclusive hierarchies, evicted blocks remain in the outer level without insertion, allowing duplication and aligning with write-back policies where dirty data requires write-back to the outer level to update its copy, preserving coherence without immediate memory writes. Cache coherence protocols, such as directory-based or snooping mechanisms, are significantly influenced by inclusion policies, which determine data copy locations and simplify protocol implementation. In inclusive hierarchies, the LLC serves as a natural snoop filter, containing all inner cache data, thereby limiting coherence probes to the outer level and reducing invalidation traffic across the interconnect—essential for scalable multi-core systems. For example, snooping protocols benefit from inclusion by avoiding unnecessary checks in private L1 caches, as any shared data modification can be resolved at the LLC, minimizing coherence overhead compared to exclusive policies that require tracking unique placements. Directory protocols similarly leverage inclusion to streamline state tracking, though they may incur higher directory storage if non-inclusive overlaps occur. Capacity implications of inclusion policies highlight trade-offs in effective storage utilization within the . Inclusive setups mandate that outer cache capacity exceeds the aggregate of inner caches (e.g., LLC size ≥ sum of all L1 and L2 sizes), leading to duplication that reduces the outer cache's unique capacity—potentially by 20-50% in typical configurations—while exclusive policies eliminate overlap to maximize total unique capacity across levels. This duplication in inclusive designs trades capacity for coherence simplicity, as the redundant copies shield inner caches from interference but limit the LLC's ability to hold additional , impacting miss rates in memory-bound workloads.

Policy Variants

Inclusive Policy

In an inclusive cache policy, the contents of the inner-level cache (such as L1) form a strict subset of the outer-level cache (such as L2 or the last-level cache, LLC), ensuring that every cache block present in the higher-speed inner cache is duplicated in the slower outer cache. This policy is typically applied between L1 and L2, as well as L2 and the shared LLC in multi-core processors. On an L1 miss, if the block is found in L2, it is copied into L1 while remaining in L2; if not present in L2, the block is fetched from the LLC or main memory and installed simultaneously in both L1 and L2 to maintain inclusion. L1 evictions do not remove the block from L2, as the outer cache serves as a backing store; however, if the block in L1 is dirty, it is written back to L2 before eviction. In contrast, evictions from L2 or the LLC require invalidating the corresponding block in L1 to enforce the subset property, often involving a check of L1 contents to select a suitable victim in the outer cache. The primary advantages of inclusive policies stem from their impact on in multi-core systems. By guaranteeing that the outer cache holds all inner cache , the LLC acts as a centralized backing store, simplifying coherence protocols such as snooping, as probes need only target the LLC without redundant checks in private L1 or L2 caches. This reduces inter-core snoop traffic and eases block location during coherence operations, since the presence of in any core's inner caches is reflected in the shared LLC, enabling efficient filtering of unnecessary broadcasts. Despite these benefits, inclusive policies introduce significant challenges related to capacity efficiency and overhead. The mandatory duplication wastes space in the outer cache, as inner cache blocks occupy portions of the LLC that could otherwise hold unique data, resulting in overhead from redundant storage of L1 and L2 contents. Additionally, LLC evictions can force the removal of high-locality blocks from L1—known as inclusion victims—leading to unnecessary inner-cache misses and increased latency, particularly in workloads with moderate reuse patterns. An early real-world implementation of inclusive policies appears in Intel's Nehalem microarchitecture (introduced in 2008), where the shared L3 cache inclusively mirrors all data from per-core L1 and L2 caches, prioritizing coherence simplicity in multi-core designs.

Exclusive Policy

In an exclusive cache policy, data blocks present in a higher-level cache, such as L1, are not permitted to coexist in the lower-level cache, like L2, ensuring no overlap between cache levels. This approach operates by treating the lower-level cache as a victim cache: upon an L1 hit, the L2 is not accessed, minimizing unnecessary traffic; L1 evictions push blocks exclusively into L2; and L2 evictions typically bypass L1 to avoid reintroduction. The policy requires precise management of block migrations, including invalidations when data moves between levels, to maintain consistency. The primary advantage of exclusive policies lies in maximizing effective cache capacity, as the total unique blocks across levels approximate the sum of individual cache sizes without duplication. For instance, with an L1 of size S1S_1 and L2 of size S2S_2, the effective total capacity is approximately S1+S2S_1 + S_2, contrasting with inclusive policies that duplicate data and reduce usable space. This non-overlapping placement is particularly beneficial in multi-core systems, where it reduces duplication overhead and enhances overall hit rates by better utilizing the aggregate cache space. However, exclusive policies introduce challenges in and implementation. Maintaining exclusivity complicates protocols, as the lower-level cache lacks copies of upper-level , often necessitating additional probes to L1 during L2 snoops or broadcasts in multi-core environments to track ownership accurately. This can increase latency for coherence operations and on-chip bandwidth due to frequent data movements on evictions and insertions. Furthermore, the policy demands equal block sizes across levels and precise invalidation mechanisms during migrations, adding hardware complexity and potential power overhead. Real-world implementations of exclusive policies include the processors, such as the K8 architecture introduced in 2003, where the L2 cache serves as an exclusive victim cache for the L1, enhancing capacity without overlap. Similarly, some designs, like the Cortex-A9 processor, support an exclusive L2 mode that can be activated to enforce non-coexistence between L1 and L2, optimizing for embedded and mobile systems.

NINE Policy

The non-inclusive non-exclusive (NINE) policy, also known as NCID (non-inclusive cache, inclusive directory), allows flexible placement in multi-level cache hierarchies without enforcing strict inclusion or exclusion rules between levels. Under this policy, present in a higher-level cache, such as L1, may or may not reside in a lower-level cache like L2 or L3, making it non-inclusive; conversely, in the lower-level cache may or may not be present in the higher-level cache, rendering it non-exclusive. To manage this flexibility while maintaining coherence, the policy employs a that tracks the presence of cache blocks across levels using tags or core-valid bits, without mandating replication or cascades. This decoupling of tag and management enables efficient snoop filtering, where the directory acts as an inclusive tracker of higher-level cache contents, even if the actual in the lower-level cache is selectively allocated. A key advantage of the NINE policy is its ability to balance effective cache capacity and access latency by minimizing unnecessary replication, thereby avoiding the "inclusion victim" problem where evicting a block from a lower-level cache forces its removal from higher levels. It permits controlled sharing of across cores for coherence purposes, reducing inter-cache traffic compared to strict inclusive designs, while offering adaptability to diverse workloads through mechanisms like dynamic bypassing or quality-of-service (QoS) policies that adjust allocation based on priority. For instance, selective insertion into the lower-level cache—allocating only a of incoming blocks—can optimize , yielding up to 45% in certain multi-core benchmarks by enhancing overall efficiency. However, implementing NINE policies introduces hardware challenges, including increased complexity for per-block tracking via expanded directory structures, which add overhead (approximately 2% area for a 256 KB L2 setup). This can lead to potential redundant copies if overlap is not tightly controlled, or additional misses in higher levels due to the unpredictability of placement without rigid rules. Coherence maintenance relies on the directory's accuracy, demanding precise management to prevent inconsistencies in multi-core environments. The degree of overlap between cache levels in a NINE policy can be quantified by the overlap ratio, defined as: Overlap ratio=shared blockstotal unique blocks\text{Overlap ratio} = \frac{\text{shared blocks}}{\text{total unique blocks}} This ratio typically satisfies 0<ratio<10 < \text{ratio} < 1, allowing optimization through techniques such as prefetching to anticipate shared data or bypassing to evict low-utility blocks, thereby tuning effective capacity without full exclusion. An example of NINE adoption appears in modern processors, such as Skylake-SP (2017) and Skylake-X (2017), where the L3 cache operates as a non-inclusive victim cache with an inclusive directory to mitigate the overhead of prior inclusive designs in large-core chip multiprocessors, improving for server workloads.

Implications and Comparisons

Coherence and Performance Trade-offs

Cache inclusion policies significantly influence the design and efficiency of protocols in multi-core systems. Inclusive policies simplify coherence mechanisms, particularly in broadcast snooping setups, by designating the last-level cache (LLC) as authoritative for all data present in private higher-level caches; this allows coherence actions to be resolved at the LLC without needing to probe private caches directly, reducing implementation complexity. In contrast, exclusive policies demand more intricate point-to-point invalidation protocols, as data blocks cannot reside in both private and shared caches simultaneously, necessitating explicit tracking and messaging to maintain consistency across levels without duplication. Non-inclusive non-exclusive (NINE) policies adopt a hybrid approach, often incorporating directory-based structures for partial sharing tracking, which balances flexibility but introduces variable overhead in coherence traffic compared to stricter inclusive or exclusive schemes. Performance metrics reveal distinct trade-offs across these policies. Inclusive policies tend to exhibit higher rates in higher-level caches due to inclusion victims—blocks evicted from the LLC that must also be invalidated from private caches, amplifying dependency on lower-level locality. Exclusive policies mitigate capacity waste from duplication, yielding lower overall rates and faster L1 hit latencies through maximized unique storage, though they may incur higher eviction rates in shared levels. NINE policies offer variable latency profiles, with hits potentially quicker in private caches but more unpredictable due to non-strict inclusion. Power consumption is elevated in inclusive designs from data duplication, which increases tag array checks and coherence overhead, whereas exclusive and NINE approaches can reduce dynamic power through optimized data placement but at the cost of additional protocol logic. These impacts can be modeled through relations, such as rate as a function of inclusion type and workload locality: for inclusive policies, the effective rate approximates a plus the product of LLC eviction rate and private cache dependency (i.e., rate ≈ + (L2 eviction rate × L1 dependency)), highlighting how propagate upward. Exclusive policies shift this toward lower base rates via non-duplication, while NINE introduces locality-dependent variability. Simulations from 2010s studies demonstrate these effects: exclusive policies improve (IPC) by 5-10% in capacity-bound workloads by enhancing effective cache utilization, outperforming inclusive baselines in single-core scenarios. In multi-core environments, NINE policies reduce coherence traffic by up to 20% through hybrid mechanisms, aiding in shared-memory systems. Trends since the mid-2010s emphasize power efficiency gains in NINE designs for multi-core scaling, addressing limitations in earlier inclusive-heavy architectures.

Real-World Implementations

Intel's early 2010s processors, such as the architecture introduced in 2011, employed an inclusive for the last-level cache (LLC), ensuring that all data in the L1 and L2 caches was also present in the L3 to simplify coherence in multi-core systems. This approach persisted through architectures like Ivy Bridge but began evolving with the Skylake family in 2015, where Intel shifted to a non-inclusive LLC in server variants like Skylake-X to improve utilization in high-core-count (8+ cores) chips by reducing and allowing more unique data storage. Subsequent generations, including and beyond, retained this non-inclusive design, often termed non-inclusive non-exclusive (NINE), to balance coherence overhead with capacity efficiency in scaling multi-core environments. In client processors, starting with in 2022, Intel introduced a dynamic inclusive/non-inclusive (INI) mechanism for the L3 cache, allowing the to switch based on to optimize for both single-threaded and multi-threaded performance. This hybrid approach continued in later generations like Arrow Lake in 2024. AMD's Zen microarchitecture, debuting in 2017 with Ryzen processors, features an inclusive policy between the L1 and L2 caches per core, where the 512 KB L2 includes copies of L1 data to streamline access patterns, while the shared L3 operates as a victim cache that is mostly exclusive of the L2 to maximize effective capacity across core complexes. In server-oriented EPYC processors based on Zen, this L2-L3 exclusivity aids multi-socket coherence by minimizing data duplication in the larger 32 MB+ L3 slices per chiplet, though some variants incorporate inclusive elements at the L2-L3 boundary to support efficient snoop filtering in NUMA configurations. Later Zen iterations, such as Zen 2, Zen 3, Zen 4, and Zen 5, maintained this hybrid approach, optimizing for both single-thread performance and multi-core scalability in data center workloads. Early designs, like the Cortex-A9 from , supported an optional exclusive for the L2 cache relative to the L1, configurable to avoid duplication and enhance capacity in embedded multi-core systems, particularly when paired with external L2 controllers. Modern ARM-based implementations in heterogeneous big.LITTLE configurations, such as Apple's M1 SoC released in 2020, adopt a NINE for the system-level cache (SLC), which is non-inclusive of CPU private caches to accommodate varying access patterns from high-performance and efficiency Icestorm cores, while remaining inclusive toward GPU caches for unified memory benefits. This design facilitates heterogeneous caching in mobile and edge devices, reducing latency in mixed workloads. Industry trends since the mid-2010s show a shift toward NINE and hybrid policies driven by the proliferation of multi-core (16+ cores) processors, as inclusive designs suffer from that hampers in bandwidth-constrained hierarchies; studies indicate non-inclusive approaches can yield up to 10-20% better multi-core in shared-cache scenarios by improving hit rates and reducing coherence . Post-2015 developments, including Intel's NCID (non-inclusive cache, inclusive directory) in server chips and dynamic INI in client chips, underscore this evolution, with ongoing research emphasizing hybrids for future many-core systems. In practice, inclusive policies face challenges from inclusion victims—data evicted from the LLC that must also be removed from private caches—leading to implement bypassing mechanisms in 2019-era optimizations, such as selective prefetch filtering and victim-aware insertion policies in and later, to mitigate unnecessary evictions and sustain performance in inclusive setups.

References

  1. https://en.wikichip.org/wiki/amd/microarchitectures/zen
Add your contribution
Related Hubs
User Avatar
No comments yet.