Cache coherence
View on Wikipedia

In computer architecture, cache coherence is the uniformity of shared resource data that is stored in multiple local caches. In a cache coherent system, if multiple clients have a cached copy of the same region of a shared memory resource, all copies are the same. Without cache coherence, a change made to the region by one client may not be seen by others, and errors can result when the data used by different clients is mismatched.[1]
A cache coherence protocol is used to maintain cache coherency. The two main types are snooping and directory-based protocols.
Cache coherence is of particular relevance in multiprocessing systems, where each CPU may have its own local cache of a shared memory resource.

Overview
[edit]In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of shared data: one copy in the main memory and one in the local cache of each processor that requested it. When one of the copies of data is changed, the other copies must reflect that change. Cache coherence is the discipline which ensures that the changes in the values of shared operands (data) are propagated throughout the system in a timely fashion.[2]
The following are the requirements for cache coherence:[3]
- Write Propagation
- Changes to the data in any cache must be propagated to other copies (of that cache line) in the peer caches.
- Transaction Serialization
- Reads/Writes to a single memory location must be seen by all processors in the same order.
Theoretically, coherence can be performed at the load/store granularity. However, in practice it is generally performed at the granularity of cache blocks.[4]
Definition
[edit]Coherence defines the behavior of reads and writes to a single address location.[3]
In a multiprocessor system, consider that more than one processor has cached a copy of the memory location X. The following conditions are necessary to achieve cache coherence:[5]
- In a read made by a processor P to a location X that follows a write by the same processor P to X, with no writes to X by another processor occurring between the write and the read instructions made by P, X must always return the value written by P.
- In a read made by a processor P1 to location X that follows a write by another processor P2 to X, with no other writes to X made by any processor occurring between the two accesses and with the read and write being sufficiently separated, X must always return the value written by P2. This condition defines the concept of coherent view of memory. Propagating the writes to the shared memory location ensures that all the caches have a coherent view of the memory. If processor P1 reads the old value of X, even after the write by P2, we can say that the memory is incoherent.
The above conditions satisfy the Write Propagation criteria required for cache coherence. However, they are not sufficient as they do not satisfy the Transaction Serialization condition. To illustrate this better, consider the following example:
A multi-processor system consists of four processors - P1, P2, P3 and P4, all containing cached copies of a shared variable S whose initial value is 0. Processor P1 changes the value of S (in its cached copy) to 10 following which processor P2 changes the value of S in its own cached copy to 20. If we ensure only write propagation, then P3 and P4 will certainly see the changes made to S by P1 and P2. However, P3 may see the change made by P1 after seeing the change made by P2 and hence return 10 on a read to S. P4 on the other hand may see changes made by P1 and P2 in the order in which they are made and hence return 20 on a read to S. The processors P3 and P4 now have an incoherent view of the memory.
Therefore, in order to satisfy Transaction Serialization, and hence achieve Cache Coherence, the following condition along with the previous two mentioned in this section must be met:
- Writes to the same location must be sequenced. In other words, if location X received two different values A and B, in this order, from any two processors, the processors can never read location X as B and then read it as A. The location X must be seen with values A and B in that order.[6]
The alternative definition of a coherent system is via the definition of sequential consistency memory model: "the cache coherent system must appear to execute all threads’ loads and stores to a single memory location in a total order that respects the program order of each thread".[4] Thus, the only difference between the cache coherent system and sequentially consistent system is in the number of address locations the definition talks about (single memory location for a cache coherent system, and all memory locations for a sequentially consistent system).
Another definition is: "a multiprocessor is cache consistent if all writes to the same memory location are performed in some sequential order".[7]
Rarely, but especially in algorithms, coherence can instead refer to the locality of reference. Multiple copies of the same data can exist in different cache simultaneously and if processors are allowed to update their own copies freely, an inconsistent view of memory can result.
Coherence mechanisms
[edit]The two most common mechanisms of ensuring coherency are snooping and directory-based, each having their own benefits and drawbacks.[8] Snooping based protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response seen by all processors. The drawback is that snooping isn't scalable. Every request must be broadcast to all nodes in a system, meaning that as the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow. Directories, on the other hand, tend to have longer latencies (with a 3 hop request/forward/respond) but use much less bandwidth since messages are point to point and not broadcast. For this reason, many of the larger systems (>64 processors) use this type of cache coherence.
Snooping
[edit]- First introduced in 1983,[9] snooping is a process where the individual caches monitor address lines for accesses to memory locations that they have cached.[5] The write-invalidate protocols and write-update protocols make use of this mechanism.
- For the snooping mechanism, a snoop filter reduces the snooping traffic by maintaining a plurality of entries, each representing a cache line that may be owned by one or more nodes. When replacement of one of the entries is required, the snoop filter selects for the replacement of the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries. A temporal or other type of algorithm is used to refine the selection if more than one cache line is owned by the fewest nodes.[10]
Directory-based
[edit]- In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed, the directory either updates or invalidates the other caches with that entry.
Distributed shared memory systems mimic these mechanisms in an attempt to maintain consistency between blocks of memory in loosely coupled systems.[11]
Coherence protocols
[edit]Coherence protocols apply cache coherence in multiprocessor systems. The intention is that two clients must never see different values for the same shared data.
The protocol must implement the basic requirements for coherence. It can be tailor-made for the target system or application.
Protocols can also be classified as snoopy or directory-based. Typically, early systems used directory-based protocols where a directory would keep a track of the data being shared and the sharers. In snoopy protocols, the transaction requests (to read, write, or upgrade) are sent out to all processors. All processors snoop the request and respond appropriately.
Write propagation in snoopy protocols can be implemented by either of the following methods:
- Write-invalidate
- When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location, which forces a read from main memory of the new value on its next access.[5]
- Write-update
- When a write operation is observed to a location that a cache has a copy of, the cache controller updates its own copy of the snooped memory location with the new data.
If the protocol design states that whenever any copy of the shared data is changed, all the other copies must be "updated" to reflect the change, then it is a write-update protocol. If the design states that a write to a cached copy by any processor requires other processors to discard or invalidate their cached copies, then it is a write-invalidate protocol.
However, scalability is one shortcoming of broadcast protocols.
Various models and protocols have been devised for maintaining coherence, such as MSI, MESI (aka Illinois), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol.[2] In 2011, ARM Ltd proposed the AMBA 4 ACE[12] for handling coherency in SoCs. The AMBA CHI (Coherent Hub Interface) specification[13] from ARM Ltd, which belongs to AMBA5 group of specifications defines the interfaces for the connection of fully coherent processors.
See also
[edit]References
[edit]- ^ Marowka, Ami (2010-01-01). "Chapter 2 - Pitfalls and Issues of Manycore Programming". Advances in Computers. Vol. 79. Elsevier. pp. 71–117. doi:10.1016/s0065-2458(10)79002-1. ISBN 978-0-12-381027-4.
- ^ a b E. Thomadakis, Michael (2011). The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms (PDF). Texas A&M University. p. 30. Archived from the original (PDF) on 2014-08-11.
- ^ a b Yan, Solihin. Fundamentals of parallel multicore architecture. OCLC 884540034.
- ^ a b Sorin, Daniel J.; Hill, Mark D.; Wood, David Allen (2011-01-01). A primer on memory consistency and cache coherence. Morgan & Claypool Publishers. OCLC 726930429.
- ^ a b c Patterson and Hennessy. Computer Organization and Design - 4th Edition. ISBN 978-0-12-374493-7.
- ^ Neupane, Mahesh (April 16, 2004). "Cache Coherence" (PDF). Archived from the original (PDF) on 20 June 2010.
- ^ Steinke, Robert C.; Nutt, Gary J. (2004-09-01). "A Unified Theory of Shared Memory Consistency". J. ACM. 51 (5): 800–849. arXiv:cs/0208027. doi:10.1145/1017460.1017464. ISSN 0004-5411. S2CID 3206071.
- ^ Patterson, David A.; Hennessy, John L. (1990). Computer Architecture A Quantitative Approach. Morgan Kaufmann Publishers. pp. 467–468. ISBN 1-55860-069-8.
- ^ "Ravishankar, Chinya; Goodman, James (February 28, 1983). "Cache Implementation for Multiple Microprocessors"" (PDF). Proceedings of IEEE COMPCON: 346–350.
- ^ Rasmus Ulfsnes (June 2013). "Design of a Snoop Filter for Snoop-Based Cache Coherency Protocols" Archived 2014-02-01 at the Wayback Machine (PDF). diva-portal.org. Norwegian University of Science and Technology. Retrieved 2014-01-20.
- ^ "Lecture 18: Snooping vs. Directory Based Coherency" (PDF). Berkeley.edu. Retrieved 14 May 2023.
- ^ Kriouile (16 September 2013). Formal Analysis of the ACE Specification for Cache Coherent Systems-on-Chip. In Formal Methods for Industrial Critical Systems. Springer Berlin Heidelberg. ISBN 978-3-642-41010-9.
- ^ Ltd, Arm. "AMBA | AMBA 5". Arm Developer. Retrieved 2021-04-27.
Further reading
[edit]- Patterson, David; Hennessy, John (2009). Computer Organization and Design (4th ed.). Morgan Kaufmann. ISBN 978-0-12-374493-7.
- Handy, Jim (1998). The Cache Memory Book (2nd ed.). Morgan Kaufmann. ISBN 9780123229809.
- Sorin, Daniel; Hill, Mark; Wood, David (2011). A Primer on Memory Consistency and Cache Coherence (PDF). Morgan and Claypool. ISBN 978-1608455645. Retrieved 20 October 2017.
- Steinke, Robert C.; Nutt, Gary J. (1 September 2004). "A unified theory of shared memory consistency". Journal of the ACM. 51 (5): 800–849. arXiv:cs/0208027. doi:10.1145/1017460.1017464. S2CID 3206071.
Cache coherence
View on GrokipediaIntroduction
Overview
Cache coherence is the discipline that ensures shared data stored in multiple local caches remains consistent across processors in a shared-memory multiprocessor system.[2] This uniformity prevents discrepancies that could arise when processors independently cache copies of the same data from main memory.[2] In symmetric multiprocessing (SMP) systems, where each processor maintains a private cache to exploit data locality, cache coherence plays a critical role in preventing data inconsistencies by synchronizing updates across all relevant cache copies.[2] Without it, processors might operate on stale or divergent versions of shared data, leading to incorrect computations in parallel applications.[2] Coherence is typically maintained at the granularity of a cache block, or line, which serves as the atomic unit for data transfers between caches and main memory.[2] This approach enables fast local access to data while upholding global consistency, often handling over 95% of memory requests entirely within caches to reduce average access latency and bus traffic.[2] A key challenge in cache coherence arises from bandwidth overhead in bus-based systems, where maintaining consistency may necessitate broadcasting updates or invalidations to all processors, potentially limiting scalability.[2]Historical Development
The cache coherence problem emerged in the late 1970s as multiprocessor systems began incorporating private caches per processor to improve performance, leading to inconsistencies where different processors could hold stale copies of shared data. Early parallel machines like the ILLIAC IV, operational in 1972, operated as SIMD array processors without private caches, avoiding coherence issues but highlighting the need for scalable memory access in parallel computing. By the late 1970s, the first formal recognition of multicache coherence challenges appeared, with proposals for directory-based tracking of cache states to ensure consistency without broadcasts. Initial symmetric multiprocessors (SMPs) in the early 1980s, such as those from Encore and early commercial systems, often lacked hardware coherence mechanisms, relying instead on software flushes or write-through policies that sacrificed performance for correctness. Snooping protocols were introduced in 1983 through the IEEE Futurebus specification, which defined a standard bus supporting cache consistency via broadcast snooping, where caches monitor bus transactions to invalidate or update local copies. This approach enabled efficient hardware enforcement of coherence in bus-based SMPs. Commercial adoption followed with systems like the Sequent Symmetry in 1987, a 20-processor i386-based SMP that implemented snooping for copy-back caches, demonstrating practical scalability for shared-memory multiprocessing.[2] The late 1980s saw the development of the MSI (Modified, Shared, Invalid) protocol, a foundational snooping-based invalidate scheme that reduced bus traffic by distinguishing modified blocks from shared ones, as evaluated in early multiprocessor simulations.[2] An optimization, the MESI (Modified, Exclusive, Shared, Invalid) protocol, emerged in the early 1990s and was integrated into Intel processors like the Pentium, adding an exclusive state to avoid unnecessary invalidations and improve bandwidth efficiency in multi-core designs. As bus-based snooping struggled with scalability beyond dozens of processors, the 1990s marked a shift to directory-based protocols, exemplified by the Stanford DASH project in 1991, which used distributed directories to track cache block ownership via point-to-point messages, enabling coherent shared memory for up to 64 processors with latencies around 120 cycles for remote accesses.[3] Post-2020 developments have focused on heterogeneous and multi-socket systems, with ARM's AMBA CHI specification reaching Issue E in 2021 to support scalable coherent interconnects across diverse accelerators and CPUs, incorporating features like atomic operations and flow control for chiplet-based designs.[4] Multi-socket AMD EPYC processors, such as the 7003 series (Milan, 2021), employ directory-based coherence over Infinity Fabric links to maintain consistency across up to 64 cores per socket.[5] Similarly, Intel Xeon processors like Sapphire Rapids (2023) integrate MESIF extensions with directory protocols for multi-socket scalability, supporting up to 8 sockets while minimizing coherence overhead in cloud and HPC environments.[6]Fundamentals
Cache Systems in Multiprocessors
In multiprocessor systems, particularly multi-core CPUs, each processing core typically features private L1 and L2 caches dedicated to its local computations, while higher-level caches such as L3 are shared among multiple cores to improve overall access efficiency.[7] The private L1 cache, split into instruction and data subsets, provides the fastest access for frequently used data, whereas the L2 cache offers larger capacity for slightly less urgent locality.[8] Shared L3 caches, often serving as the last-level cache (LLC), aggregate resources across cores to reduce latency for inter-core data sharing and minimize trips to main memory.[9] The cache hierarchy in these systems is organized into multiple levels, with L1 being the smallest and fastest (typically 32-64 KB per core), L2 intermediate (256 KB to 2 MB per core), and L3 the largest (8-128 MB shared or more).[10] Caches employ set-associative mapping, where associativity determines the number of cache lines per set to balance hit rates and hardware complexity. Cache blocks, or lines, are standardized at 64 bytes to align with memory bus widths and typical data access patterns.[11] Multiprocessors operate under shared memory models, primarily uniform memory access (UMA) or non-uniform memory access (NUMA). In UMA systems, all processors access the shared memory with equal latency via a common interconnect, simplifying design but limiting scalability.[12] NUMA architectures, common in larger systems, assign memory modules locally to processor nodes, resulting in faster access to local memory but higher latency for remote accesses, which influences cache design for locality optimization.[13] Cores interact with memory hierarchically: a core first checks its private L1 cache for reads or writes; misses propagate to L2, then to shared L3, and finally to main memory if needed.[14] This process enables data replication, where the same memory block can reside in multiple caches simultaneously to exploit spatial and temporal locality and reduce bus contention.[14] Such replication enhances performance but introduces the potential for inconsistencies across caches. Inclusion policies dictate how data propagates between levels: inclusive policies require all data in lower-level caches (e.g., L1/L2) to also exist in the higher-level cache (e.g., L3), as seen in many Intel architectures like Nehalem.[15] Exclusive policies, conversely, prohibit overlap, allowing higher utilization of total cache space, which AMD processors often employ in their LLC designs.[15]The Coherence Problem
In multiprocessor systems equipped with private caches, the cache coherence problem emerges when multiple copies of the same memory block are cached across processors, allowing one processor's modification to go unreflected in others' caches, thereby producing inconsistent data views. This issue stems from the replication of data blocks to exploit locality, but without coordination, it violates the expectation that shared memory provides a uniform value for any location at any time.[1] A fundamental scenario illustrating this problem involves two processors accessing a shared variable A initialized to 42 in main memory. Processor P1 reads A into its cache, obtaining 42. Subsequently, P1 writes 43 to A, updating its local cache copy while leaving main memory unchanged in a write-back policy. Meanwhile, processor P2 reads A into its cache before P1's write, also obtaining 42, and later rereads A from its cache, retrieving the stale 42 instead of the updated 43. This demonstrates how P2 observes an outdated value despite P1's intervening write.[1] To highlight the progression of cache states in this two-processor example without coherence mechanisms, consider the operations on shared block containing A (initially 42 in memory), assuming a basic state model where blocks can be invalid (I), shared (S), or modified (M/dirty):| Operation | P1 Cache State for Block | P2 Cache State for Block | Main Memory |
|---|---|---|---|
| Initial | I | I | 42 |
| P1 reads A | S, 42 | I | 42 |
| P2 reads A | S, 42 | S, 42 | 42 |
| P1 writes 43 to A | M, 43 (dirty) | S, 42 | 42 |
| P2 rereads A | M, 43 (dirty) | S, 42 (hit, stale) | 42 |
Requirements and Models
Formal Definition
Cache coherence is formally defined as a protocol-level property in shared-memory multiprocessor systems that maintains a consistent view of shared data across multiple private caches, ensuring that the effects of memory operations are uniformly observable by all processors. This definition encompasses two fundamental invariants: the single-writer-multiple-reader (SWMR) invariant and the data-value invariant. The SWMR invariant stipulates that, for any shared memory location, at any logical time, either exactly one processor has permission to both read from and write to it (acting as the sole writer and reader), or multiple processors have read-only permission with no writer allowed.[1] The data-value invariant requires that the value stored in a memory location at the conclusion of a read-only epoch matches the value established at the end of the preceding read-write epoch for that location.[1] These invariants are upheld through three essential requirements: write propagation, serialization, and read reflection. Write propagation ensures that every write operation to a shared location eventually becomes visible to all other processors in the system, preventing stale data from persisting indefinitely in remote caches.[1] Serialization imposes a total order on all writes to the same memory location, such that every processor perceives these writes in the identical sequence, thereby avoiding conflicting interpretations of update history.[1] Read reflection guarantees that any read operation following a write to the same location will retrieve either the value from that write or from a later write in the serialized order, ensuring reads accurately reflect the current state.[1] A precise formal statement of cache coherence is as follows: for any memory address $ A $, if a processor executes a write $ W $ to $ A $ followed later by a read $ R $ to $ A $ (possibly on a different processor), then $ R $ must return the value produced by $ W $ or by some write $ W' $ to $ A $ that follows $ W $ in the total serialization order of writes to $ A $; moreover, all reads across processors must observe the writes to $ A $ in this same total order.[1] Cache coherence must be distinguished from cache consistency in a single-processor environment. In uniprocessor systems, cache consistency pertains solely to synchronizing a single cache with main memory, often via mechanisms like write-through or write-back policies to ensure the processor sees its own updates promptly. By contrast, cache coherence in multiprocessor systems extends this to coordinate multiple caches, ensuring inter-cache agreement on shared data values and preventing inconsistencies arising from concurrent access by multiple processors.[1] Coherence is maintained at the granularity of cache lines (also called cache blocks), treating these fixed-size units—typically 64 bytes—as the atomic entities for consistency management, rather than finer-grained bytes or words, to optimize hardware efficiency and reduce protocol overhead.[1] Edge cases include uncached memory accesses, where data bypasses caches and interacts directly with main memory; coherence protocols must ensure these do not disrupt invariants, often by routing such accesses through the same serialization points as cached operations. Additionally, I/O coherence addresses direct memory access (DMA) by peripherals, requiring extensions like flush or snoop mechanisms to propagate device writes to processor caches and invalidate stale copies, maintaining system-wide consistency.[1]Consistency Models
Memory consistency models specify the permissible orderings of memory operations—reads and writes—as observed across multiple processors in a shared-memory multiprocessor system, defining a contract between hardware and software on how operations interleave and become visible.[1] These models ensure that parallel programs execute predictably, but they differ fundamentally from cache coherence, which focuses solely on ensuring a unique value for each memory location at any time, without prescribing inter-operation ordering.[1] Cache coherence provides the foundational visibility of updates (e.g., via invalidation or update protocols), enabling the implementation of various consistency models by guaranteeing that writes propagate correctly, while consistency models build atop this to enforce broader ordering guarantees for correctness in parallel programming.[1] The strongest such model is sequential consistency (SC), introduced by Lamport, which mandates that the results of any execution appear as if all memory operations from all processors occurred atomically in some global total order that respects the program order on each individual processor.[16] Under SC, operations on distinct memory locations must also appear sequentially ordered, preventing reordering that could alter program semantics, such as a processor seeing its own write before another processor's intervening write.[16] This model simplifies programming by mimicking the behavior of a uniprocessor but imposes strict hardware constraints, often requiring full serialization of operations to maintain the global order.[1] Weaker models relax these constraints to enhance performance through reordering and buffering, while still leveraging cache coherence for value visibility. Processor consistency, proposed by Goodman, preserves per-processor program order for reads following reads (R→R) and writes following writes (W→W), but allows a read to see an old value even after a write to the same location if the write is from another processor, as long as writes establish a global total order visible to future reads. This permits store buffering, where writes are delayed before becoming globally visible, reducing contention but requiring programmers to use synchronization for cross-processor ordering. Release consistency, developed by Gharachorloo et al., further weakens guarantees by tying ordering to explicit synchronization events like acquires and releases; ordinary memory operations (acquires, releases, and data accesses) can be reordered freely within their categories, but synchronization points enforce consistency, allowing aggressive hardware optimizations such as delayed updates. Hardware implements these models via cache coherence protocols combined with mechanisms like store buffers, load bypassing, and memory fences (barriers). For SC, coherence protocols must enforce immediate global visibility and strict serialization, often at high cost; weaker models like processor or release consistency tolerate relaxed propagation, using fences to insert ordering points that flush buffers or stall reordering, ensuring coherence-maintained updates respect synchronization.[1] For example, the x86 architecture's Total Store Order (TSO) model, a form of processor consistency variant, allows stores to buffer before global visibility (permitting out-of-order reads of other processors' writes) but enforces total ordering for loads and stores within a processor, relying on cache coherence for write propagation and serializing instructions (e.g., SFENCE) for stronger guarantees. In contrast, ARM's weak ordering model permits extensive reordering of loads and stores across processors unless controlled by explicit barriers (e.g., DMB), where cache coherence ensures update visibility but programmers must insert barriers to achieve sequential-like behavior, highlighting how coherence alone does not suffice without model-specific synchronization.Core Mechanisms
Snooping Protocols
Snooping protocols maintain cache coherence in bus-based shared-memory multiprocessor systems by having each cache controller continuously monitor, or "snoop," all transactions on the shared bus for memory addresses that match lines in its cache.[17] This monitoring allows caches to detect relevant read or write requests from other processors and respond accordingly to ensure data consistency across the system.[17] The approach relies on a broadcast medium like a bus, where every transaction is visible to all caches, enabling simple hardware implementation without centralized coordination.[17] Cache controllers in snooping systems typically track line states such as modified (data has been altered and is the only valid copy), shared (data is unchanged and may exist in multiple caches), and invalid (data is no longer valid and must be fetched from memory or another cache).[18] State transitions occur in response to bus events: for example, a processor's read miss triggers a bus read request, which other caches snoop and supply data if in the modified state, transitioning to shared; a write request may invalidate shared copies in other caches to prevent stale data.[18] These transitions ensure that no processor reads outdated data while allowing efficient local access to unmodified lines.[18] Specific protocols like the MSI (Modified-Shared-Invalid) protocol exemplify this by defining precise rules for state changes on bus requests.[18] Snooping protocols operate under two primary bus strategies: write-invalidate and write-update. In write-invalidate protocols, a writing processor broadcasts an invalidate signal on the bus before updating its copy, forcing other caches to mark matching lines invalid and refetch on future reads.[18] This minimizes bus traffic for repeated writes by a single processor but can increase misses if sharing is frequent.[18] In contrast, write-update protocols broadcast the new data value to all caches on a write, allowing shared copies to update in place without invalidation, which reduces latency for subsequent reads by others but consumes more bandwidth due to data broadcasts.[18] Write-invalidate is more common in practice due to lower overall traffic in typical workloads.[18] These protocols offer low-latency coherence enforcement in small-scale systems, as snooping enables quick interventions without directory lookups or point-to-point messaging.[19] However, they suffer from bus bandwidth contention caused by frequent broadcasts, limiting scalability to modest numbers of processors, typically 16 to 32, beyond which traffic overwhelms the shared medium.[19] Implementations are prevalent in uniform memory access (UMA) architectures, such as early generations of Intel Xeon processors that used the Front Side Bus (FSB) for multi-socket coherence via snooping.[20]Directory-Based Protocols
Directory-based protocols address the scalability limitations of snooping protocols by maintaining a centralized or distributed directory that tracks the state and location of each cache block across the system. Unlike snooping mechanisms that rely on broadcasting requests to all caches, directory-based approaches use point-to-point communication to notify only the relevant caches, making them suitable for large-scale non-uniform memory access (NUMA) and distributed shared-memory systems. The directory typically resides with the home memory node for each block and records which caches hold copies of the block (sharers) and in what state, such as shared, exclusive, or modified.[21] The core operation involves consulting the directory on a cache miss. For a read miss, the requesting processor sends a request to the home directory; if the block is clean and shared, it is supplied from memory or forwarded from a sharer, and the directory updates to include the new sharer. For a write miss, the directory identifies all current sharers and sends invalidation messages point-to-point to them, ensuring exclusive access before supplying the block to the requester. This selective notification reduces network traffic compared to broadcasts, though it introduces indirection latency as the home node coordinates responses. Distributed directories, where entries are spread across memory controllers, further enhance scalability by localizing access and avoiding a single point of contention.[22][23] Several variants optimize the directory structure to balance storage overhead and performance. The full-bit vector scheme uses one bit per processor to indicate presence, plus state bits (e.g., a dirty bit), allowing precise tracking of all sharers but incurring high storage costs (e.g., up to 50% overhead for 256 processors). Coarse vector variants group processors into clusters and use fewer bits per group, approximating sharer locations to reduce space while tolerating some imprecision. Pointer-based schemes store a limited number of pointers (e.g., 4-8) to exact sharer locations, which is efficient when few caches share blocks but requires overflow handling. When the pointer limit is exceeded, protocols may evict a random pointer (potentially causing unnecessary invalidations), fall back to broadcasting, or use chaining to link additional entries, though this adds complexity and latency. These trade-offs were evaluated in early simulations showing full-bit vectors provide the best accuracy but scale poorly in storage, while limited pointers suit systems with low sharing degrees.[21][22] Directory-based protocols offer significant advantages in scalability, supporting hundreds of processors without the broadcast storms that limit snoopy systems to dozens, but they suffer from higher access latency due to directory lookups and multi-message exchanges. Seminal implementations include the Stanford DASH multiprocessor (1990), which pioneered distributed full-bit vector directories for up to 64 processors, influencing later designs. In the 1990s, SGI Origin servers employed hybrid bit-vector and coarse-vector directories, scaling to 1024 processors in cc-NUMA configurations with low remote access penalties. Modern examples, such as multi-socket AMD EPYC systems, integrate directory-based coherence over Infinity Fabric links, using probe filters in the L3 cache to track inter-socket sharers efficiently in MOESI protocols.[24]Specific Protocols
Invalidation-Based Protocols
Invalidation-based protocols maintain cache coherence by invalidating copies of a cache line in other caches when a processor writes to it, ensuring that only the writing cache has a valid copy. This approach contrasts with update-based methods by avoiding the broadcast of write data, thereby conserving bandwidth in shared-bus systems. These protocols typically operate in snoopy implementations, where caches monitor bus transactions to update their states accordingly.[1] The MSI (Modified, Shared, Invalid) protocol is a foundational invalidation-based scheme using three stable states for each cache line: Invalid (I), indicating the line is uninitialized or stale; Shared (S), indicating the line is clean and possibly replicated across multiple caches; and Modified (M), indicating the line is dirty and uniquely held by one cache. Transitions occur on processor requests (PrRd for read, PrWr for write) or bus actions (BusRd for shared reads, BusRdX for exclusive reads/writes, BusUpgr for upgrades). In a read hit, the state remains unchanged whether in S or M. A read miss from I fetches the line via BusRd, transitioning to S if shared or potentially M if exclusive. A write hit in S issues BusUpgr to invalidate others, moving to M; in M, it stays M without bus action. A write miss from I issues BusRdX, fetching and invalidating others to enter M. On eviction or write-back from M, the line transitions to I after supplying data via BusWr.[1] The following table summarizes key state transitions in the MSI protocol for a snoopy bus implementation:| Event | Current State | Action/Bus Request | New State | Notes |
|---|---|---|---|---|
| PrRd (read hit) | S | None | S | Data from cache |
| PrRd (read hit) | M | None | M | Data from cache |
| PrRd (read miss) | I | BusRd | S | Fetch from memory or sharer |
| PrWr (write hit) | S | BusUpgr | M | Invalidate other caches |
| PrWr (write hit) | M | None | M | Update cache only |
| PrWr (write miss) | I | BusRdX | M | Fetch, invalidate others |
| Eviction/Write-back | M | BusWr | I | Supply dirty data to bus |
| Event | Current State | Action/Bus Request | New State | Notes |
|---|---|---|---|---|
| PrRd (read hit) | E | None | E | Data from cache |
| PrRd (read miss) | I | BusRd | E | If no sharers (clean fetch) |
| PrRd (read miss) | I | BusRd | S | If sharers exist |
| PrWr (write hit) | E | None | M | Silent upgrade |
| PrWr (write hit) | S | BusUpgr | M | Invalidate others |
| Snooped BusRd | E | None | S | Downgrade to shared |
| Eviction | E | None (PutE) | I | No data write-back |
| Event | Current State | Action/Bus Request | New State | Notes |
|---|---|---|---|---|
| PrRd (read hit) | O | None | O | Data from cache |
| PrWr (write hit) | O | None | M | Update owned line |
| Snooped BusRd | M | BusWr (supply data) | O | Share dirty data, retain ownership |
| Snooped BusRd | O | BusWr (supply data) | O | Service from owner |
| PrRd (read miss) | I | BusRd | O | If owner provides dirty data |