Hubbry Logo
CPU cacheCPU cacheMain
Open search
CPU cache
Community hub
CPU cache
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
CPU cache
CPU cache
from Wikipedia

A CPU cache is a hardware cache used by the central processing unit (CPU) of the computer to reduce an average cost (time or energy) to the access data from a main memory.[1] A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations, avoiding the need to always refer to main memory which may be tens to hundreds of times slower to access.

Cache memory is typically implemented with static random-access memory (SRAM), which requires multiple transistors to store a single bit. This makes it expensive in terms of the area it takes up, and in modern CPUs the cache is typically the largest part by chip area. The size of the cache needs to be balanced with the general desire for smaller chips which cost less. Some modern designs implement some or all of their cache using the physically smaller eDRAM, which is slower to use than SRAM but allows larger amounts of cache for any given amount of chip area.

Most CPUs have the main hierarchy of multiple cache levels (L1, L2, often L3, and rarely even L4), with the separate instruction-specific (I-cache) and data-specific (D-cache) caches at level 1.[2] The different levels are implemented in different areas of the chip; L1 is located as close to a CPU core as possible and thus offers the highest speed due to short signal paths, but requires careful design. L2 caches are physically separate from the CPU and operate slower, but place fewer demands on the chip designer and can be made much larger without impacting the CPU design. L3 caches are generally shared among multiple CPU cores.

Other types of caches exist (that are not counted towards the "cache size" of the most important caches mentioned above), such as the translation lookaside buffer (TLB) which is part of the memory management unit (MMU) which most CPUs have. Input/output sections also often contain data buffers that serve a similar purpose.

Overview

[edit]

To access data in main memory, a multi-step process is used and each step introduces a delay. For instance, to read a value from memory in a simple computer system the CPU first selects the address to be accessed by expressing it on the address bus and waiting a fixed time to allow the value to settle. The memory device with that value, normally implemented in DRAM, holds that value in a very low-energy form that is not powerful enough to be read directly by the CPU. Instead, it has to copy that value from storage into a small buffer which is connected to the data bus. The CPU then waits a certain time to allow this value to settle before reading the value from the data bus.

By locating the memory physically closer to the CPU the time needed for the busses to settle is reduced, and by replacing the DRAM with SRAM, which hold the value in a form that does not require amplification to be read, the delay within the memory itself is eliminated. This makes the cache much faster both to respond and to read or write. SRAM, however, requires anywhere from four to six transistors to hold a single bit, depending on the type, whereas DRAM generally uses one transistor and one capacitor per bit, which makes it able to store much more data for any given chip area.

Implementing some memory in a faster format can lead to large performance improvements. When trying to read from or write to a location in the memory, the processor checks whether the data from that location is already in the cache. If so, the processor will read from or write to the cache instead of the much slower main memory.

Many modern desktop, server, and industrial CPUs have at least three independent levels of caches (L1, L2 and L3) and different types of caches:

Translation lookaside buffer (TLB)
Used to speed up virtual-to-physical address translation for both executable instructions and data. A single TLB can be provided for access to both instructions and data, or a separate Instruction TLB (ITLB) and data TLB (DTLB) can be provided. However, the TLB cache is part of the memory management unit (MMU) and not directly related to the CPU caches.
Instruction cache
MicroOp-cache
Branch target buffer
Instruction cache (I-cache)
Used to speed executable instruction fetch
Data cache
Data cache (D-cache)
Used to speed data fetch and store; the data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.; see also multi-level caches below).

History

[edit]
Motherboard of a NeXTcube computer (1990). At the lower edge of the image left from the middle, there is the CPU Motorola 68040 operated at 25 MHz with two separate level 1 caches of 4 KiB each on the chip, one for the instructions and one for data. The board has no external L2 cache.

Early examples of CPU caches include the Atlas 2[3] and the IBM System/360 Model 85[4][5] in the 1960s. The first CPUs that used a cache had only one level of cache; unlike later level 1 cache, it was not split into L1d (for data) and L1i (for instructions). Split L1 cache started in 1976 with the IBM 801 CPU,[6][7] became mainstream in the late 1980s, and in 1997 entered the embedded CPU market with the ARMv5TE. As of 2015, even sub-dollar SoCs split the L1 cache. They also have L2 caches and, for larger processors, L3 caches as well. The L2 cache is usually not split, and acts as a common repository for the already split L1 cache. Every core of a multi-core processor has a dedicated L1 cache and is usually not shared between the cores. The L2 cache, and lower-level caches, may be shared between the cores. L4 cache is currently uncommon, and is generally dynamic random-access memory (DRAM) on a separate die or chip, rather than static random-access memory (SRAM). An exception to this is when eDRAM is used for all levels of cache, down to L1. Historically L1 was also on a separate die, however bigger die sizes have allowed integration of it as well as other cache levels, with the possible exception of the last level. Each extra level of cache tends to be smaller and faster than the lower levels.[8]

Caches (like for RAM historically) have generally been sized in powers of: 2, 4, 8, 16 etc. KiB; when up to MiB sizes (i.e. for larger non-L1), very early on the pattern broke down, to allow for larger caches without being forced into the doubling-in-size paradigm, with e.g. Intel Core 2 Duo with 3 MiB L2 cache in April 2008. This happened much later for L1 caches, as their size is generally still a small number of KiB. The IBM zEC12 from 2012 is an exception however, to gain unusually large 96 KiB L1 data cache for its time, and e.g. the IBM z13 having a 96 KiB L1 instruction cache (and 128 KiB L1 data cache),[9] and Intel Ice Lake-based processors from 2018, having 48 KiB L1 data cache and 48 KiB L1 instruction cache. In 2020, some Intel Atom CPUs (with up to 24 cores) have (multiple of) 4.5 MiB and 15 MiB cache sizes.[10][11]

Operation

[edit]

Cache entries

[edit]

Data is transferred between memory and cache in blocks of fixed size, called cache lines or cache blocks. When a cache line is copied from memory into the cache, a cache entry is created. The cache entry will include the copied data as well as the requested memory location (called a tag).

When the processor needs to read or write a location in memory, it first checks for a corresponding entry in the cache. The cache checks for the contents of the requested memory location in any cache lines that might contain that address. If the processor finds that the memory location is in the cache, a cache hit has occurred. However, if the processor does not find the memory location in the cache, a cache miss has occurred. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. For a cache miss, the cache allocates a new entry and copies data from main memory, then the request is fulfilled from the contents of the cache.

Policies

[edit]

Replacement policies

[edit]

To make room for the new entry on a cache miss, the cache may have to evict one of the existing entries. The heuristic it uses to choose the entry to evict is called the replacement policy. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future. Predicting the future is difficult, so there is no perfect method to choose among the variety of replacement policies available. One popular replacement policy, least-recently used (LRU), replaces the least recently accessed entry.

Marking some memory ranges as non-cacheable can improve performance, by avoiding caching of memory regions that are rarely re-accessed. This avoids the overhead of loading something into the cache without having any reuse. Cache entries may also be disabled or locked depending on the context.

Write policies

[edit]

If data are written to the cache, at some point they must also be written to main memory; the timing of this write is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to the main memory, with locations been written over being marked as dirty, being written back to the main memory only when they are evicted from the cache. For this reason, a read miss in a write-back cache may sometimes require two memory accesses to service: one to first write the dirty location to main memory, and then another to read the new location from memory. Also, a write to a main memory location that is not yet mapped in a write-back cache may evict an already dirty location, thereby freeing that cache space for the new memory location.

There are intermediate policies as well. The cache may be write-through, but the writes may be held in a store data queue temporarily, usually so multiple stores can be processed together (which can reduce bus turnarounds and improve bus utilization).

Cached data from the main memory may be changed by other entities (e.g., peripherals using direct memory access (DMA) or another core in a multi-core processor), in which case the copy in the cache may become out-of-date or stale. Alternatively, when a CPU in a multiprocessor system updates data in the cache, copies of data in caches associated with other CPUs become stale. Communication protocols between the cache managers that keep the data consistent are known as cache coherence protocols.

Cache performance

[edit]

Cache performance measurement has become important in recent times where the speed gap between the memory performance and the processor performance is increasing exponentially. The cache was introduced to reduce this speed gap. Thus knowing how well the cache is able to bridge the gap in the speed of processor and memory becomes important, especially in high-performance systems. The cache hit rate and the cache miss rate play an important role in determining this performance. To improve the cache performance, reducing the miss rate becomes one of the necessary steps among other steps. Decreasing the access time to the cache also gives a boost to its performance and helps with optimization.

CPU stalls

[edit]

The time taken to fetch one cache line from memory (read latency due to a cache miss) matters because the CPU will run out of work while waiting for the cache line. When a CPU reaches this state, it is called a stall. As CPUs become faster compared to main memory, stalls due to cache misses displace more potential computation; modern CPUs can execute hundreds of instructions in the time taken to fetch a single cache line from main memory.

Various techniques have been employed to keep the CPU busy during this time, including out-of-order execution in which the CPU attempts to execute independent instructions after the instruction that is waiting for the cache miss data. Another technology, used by many processors, is simultaneous multithreading (SMT), which allows an alternate thread to use the CPU core while the first thread waits for required CPU resources to become available.

Associativity

[edit]
An illustration of different ways in which memory locations can be cached by particular cache locations

The placement policy decides where in the cache a copy of a particular entry of main memory will go. If the placement policy is free to choose any entry in the cache to hold the copy, the cache is called fully associative. At the other extreme, if each entry in the main memory can go in just one place in the cache, the cache is direct-mapped. Many caches implement a compromise in which each entry in the main memory can go to any one of N places in the cache, and are described as N-way set associative.[12] For example, the level-1 data cache in an AMD Athlon is two-way set associative, which means that any particular location in main memory can be cached in either of two locations in the level-1 data cache.

Choosing the right value of associativity involves a trade-off. If there are ten places to which the placement policy could have mapped a memory location, then to check if that location is in the cache, ten cache entries must be searched. Checking more places takes more power and chip area, and potentially more time. On the other hand, caches with more associativity suffer fewer misses (see conflict misses), so that the CPU wastes less time reading from the slow main memory. The general guideline is that doubling the associativity, from direct mapped to two-way, or from two-way to four-way, has about the same effect on raising the hit rate as doubling the cache size. However, increasing associativity more than four does not improve hit rate as much,[13] and are generally done for other reasons (see virtual aliasing). Some CPUs can dynamically reduce the associativity of their caches in low-power states, which acts as a power-saving measure.[14]

In order of worse but simple to better but complex:

  • Direct mapped cache – good best-case time, but unpredictable in the worst case
  • Two-way set associative cache
  • Two-way skewed associative cache[15]
  • Four-way set-associative cache
  • Eight-way set-associative cache, a common choice for later implementations
  • 12-way set associative cache, similar to eight-way
  • Fully associative cache – the best miss rates, but practical only for a small number of entries

Direct-mapped cache

[edit]

In this cache organization, each location in the main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a placement policy as such, since there is no choice of which cache entry's contents to evict. This means that if two locations map to the same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let x be block number in cache, y be block number of memory, and n be number of blocks in cache, then mapping is done with the help of the equation x = y mod n.

Two-way set associative cache

[edit]

If each location in the main memory can be cached in either of two locations in the cache, one logical question is: which one of the two? The simplest and most commonly used scheme, shown in the right-hand diagram above, is to use the least significant bits of the memory location's index as the index for the cache memory, and to have two entries for each index. One benefit of this scheme is that the tags stored in the cache do not have to include that part of the main memory address which is implied by the cache memory's index. Since the cache tags have fewer bits, they require fewer transistors, take less space on the processor circuit board or on the microprocessor chip, and can be read and compared faster. Also LRU algorithm is especially simple since only one bit needs to be stored for each pair.

Speculative execution

[edit]

One of the advantages of a direct-mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index which might have a copy of that location in memory is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.

The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address. The entry selected by the hint can then be used in parallel with checking the full tag. The hint technique works best when used in the context of address translation, as explained below.

Two-way skewed associative cache

[edit]

Other schemes have been suggested, such as the skewed cache,[15] where the index for way 0 is direct, as above, but the index for way 1 is formed with a hash function. A good hash function has the property that addresses which conflict with the direct mapping tend not to conflict when mapped with the hash function, and so it is less likely that a program will suffer from an unexpectedly large number of conflict misses due to a pathological access pattern. The downside is extra latency from computing the hash function.[16] Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; LRU tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages over conventional set-associative ones.[17]

Pseudo-associative cache

[edit]

A true set-associative cache tests all the possible ways simultaneously, using something like a content-addressable memory. A pseudo-associative cache tests each possible way one at a time. A hash-rehash cache and a column-associative cache are examples of a pseudo-associative cache.

In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as a direct-mapped cache, but it has a much lower conflict miss rate than a direct-mapped cache, closer to the miss rate of a fully associative cache.[16]

Multicolumn cache

[edit]

Comparing with a direct-mapped cache, a set associative cache has a reduced number of bits for its cache set index that maps to a cache set, where multiple ways or blocks stays, such as 2 blocks for a 2-way set associative cache and 4 blocks for a 4-way set associative cache. Comparing with a direct mapped cache, the unused cache index bits become a part of the tag bits. For example, a 2-way set associative cache contributes 1 bit to the tag and a 4-way set associative cache contributes 2 bits to the tag. The basic idea of the multicolumn cache[18] is to use the set index to map to a cache set as a conventional set associative cache does, and to use the added tag bits to index a way in the set. For example, in a 4-way set associative cache, the two bits are used to index way 00, way 01, way 10, and way 11, respectively. This double cache indexing is called a "major location mapping", and its latency is equivalent to a direct-mapped access. Extensive experiments in multicolumn cache design[18] shows that the hit ratio to major locations is as high as 90%. If cache mapping conflicts with a cache block in the major location, the existing cache block will be moved to another cache way in the same set, which is called "selected location". Because the newly indexed cache block is a most recently used (MRU) block, it is placed in the major location in multicolumn cache with a consideration of temporal locality. Since multicolumn cache is designed for a cache with a high associativity, the number of ways in each set is high; thus, it is easy find a selected location in the set. A selected location index by an additional hardware is maintained for the major location in a cache block.[citation needed]

Multicolumn cache remains a high hit ratio due to its high associativity, and has a comparable low latency to a direct-mapped cache due to its high percentage of hits in major locations. The concepts of major locations and selected locations in multicolumn cache have been used in several cache designs in ARM Cortex R chip,[19] Intel's way-predicting cache memory,[20] IBM's reconfigurable multi-way associative cache memory[21] and Oracle's dynamic cache replacement way selection based on address tab bits.[22]

Cache entry structure

[edit]

Cache row entries usually have the following structure:

tag data block flag bits

The data block (cache line) contains the actual data fetched from the main memory. The tag contains (part of) the address of the actual data fetched from the main memory. The flag bits are discussed below.

The "size" of the cache is the amount of main memory data it can hold. This size can be calculated as the number of bytes stored in each data block times the number of blocks stored in the cache. (The tag, flag and error correction code bits are not included in the size,[23] although they do affect the physical area of a cache.)

An effective memory address which goes along with the cache line (memory block) is split (MSB to LSB) into the tag, the index and the block offset.[8][24]

tag index block offset

The index describes which cache set that the data has been put in. The index length is bits for s cache sets.

The block offset specifies the desired data within the stored data block within the cache row. Typically the effective address is in bytes, so the block offset length is bits, where b is the number of bytes per data block. The tag contains the most significant bits of the address, which are checked against all rows in the current set (the set has been retrieved by index) to see if this set contains the requested address. If it does, a cache hit occurs. The tag length in bits is as follows:

tag_length = address_length - index_length - block_offset_length

Some authors refer to the block offset as simply the "offset"[25] or the "displacement".[26][27]

Example

[edit]

The original Pentium 4 processor had a four-way set associative L1 data cache of 8 KiB in size, with 64-byte cache blocks. Hence, there are 8 KiB / 64 = 128 cache blocks. The number of sets is equal to the number of cache blocks divided by the number of ways of associativity, what leads to 128 / 4 = 32 sets, and hence 25 = 32 different indices. There are 26 = 64 possible offsets. Since the CPU address is 32 bits wide, this implies 32 − 5 − 6 = 21 bits for the tag field.

The original Pentium 4 processor also had an eight-way set associative L2 integrated cache 256 KiB in size, with 128-byte cache blocks. This implies 32 − 8 − 7 = 17 bits for the tag field.[25]

Flag bits

[edit]

An instruction cache requires only one flag bit per cache row entry: a valid bit. The valid bit indicates whether or not a cache block has been loaded with valid data.

On power-up, the hardware sets all the valid bits in all the caches to "invalid". Some systems also set a valid bit to "invalid" at other times, such as when multi-master bus snooping hardware in the cache of one processor hears an address broadcast from some other processor, and realizes that certain data blocks in the local cache are now stale and should be marked invalid.

A data cache typically requires two flag bits per cache line – a valid bit and a dirty bit. Having a dirty bit set indicates that the associated cache line has been changed since it was read from main memory ("dirty"), meaning that the processor has written data to that line and the new value has not propagated all the way to main memory.

Cache miss

[edit]

A cache miss is a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss.

Cache read misses from an instruction cache generally cause the largest delay, because the processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched from main memory. Cache read misses from a data cache usually cause a smaller delay, because instructions not dependent on the cache read can be issued and continue execution until the data are returned from main memory, and the dependent instructions can resume execution. Cache write misses to a data cache generally cause the shortest delay, because the write can be queued and there are few limitations on the execution of subsequent instructions; the processor can continue until the queue is full. For a detailed introduction to the types of misses, see cache performance measurement and metric.

Address translation

[edit]

Most general purpose CPUs implement some form of virtual memory. To summarize, either each program running on the machine sees its own simplified address space, which contains code and data for that program only, or all programs run in a common virtual address space. A program executes by calculating, comparing, reading and writing to addresses of its virtual address space, rather than addresses of physical address space, making programs simpler and thus easier to write.

Virtual memory requires the processor to translate virtual addresses generated by the program into physical addresses in main memory. The portion of the processor that does this translation is known as the memory management unit (MMU). The fast path through the MMU can perform those translations stored in the translation lookaside buffer (TLB), which is a cache of mappings from the operating system's page table, segment table, or both.

For the purposes of the present discussion, there are three important features of address translation:

  • Latency: The physical address is available from the MMU some time, perhaps a few cycles, after the virtual address is available from the address generator.
  • Aliasing: Multiple virtual addresses can map to a single physical address. Most processors guarantee that all updates to that single physical address will happen in program order. To deliver on that guarantee, the processor must ensure that only one copy of a physical address resides in the cache at any given time.
  • Granularity: The virtual address space is broken up into pages. For instance, a 4 GiB virtual address space might be cut up into 1,048,576 pages of 4 KiB size, each of which can be independently mapped. There may be multiple page sizes supported; see virtual memory for elaboration.

One early virtual memory system, the IBM M44/44X, required an access to a mapping table held in core memory before every programmed access to main memory.[28][NB 1] With no caches, and with the mapping table memory running at the same speed as main memory this effectively cut the speed of memory access in half. Two early machines that used a page table in main memory for mapping, the IBM System/360 Model 67 and the GE 645, both had a small associative memory as a cache for accesses to the in-memory page table. Both machines predated the first machine with a cache for main memory, the IBM System/360 Model 85, so the first hardware cache used in a computer system was not a data or instruction cache, but rather a TLB.

Caches can be divided into four types, based on whether the index or tag correspond to physical or virtual addresses:

  • Physically indexed, physically tagged (PIPT) caches use the physical address for both the index and the tag. While this is simple and avoids problems with aliasing, it is also slow, as the physical address must be looked up (which could involve a TLB miss and access to main memory) before that address can be looked up in the cache.
  • Virtually indexed, virtually tagged (VIVT) caches use the virtual address for both the index and the tag. This caching scheme can result in much faster lookups, since the MMU does not need to be consulted first to determine the physical address for a given virtual address. However, VIVT suffers from aliasing problems, where several different virtual addresses may refer to the same physical address. The result is that such addresses would be cached separately despite referring to the same memory, causing coherency problems. Although solutions to this problem exist[31] they do not work for standard coherence protocols. Another problem is homonyms, where the same virtual address maps to several different physical addresses. It is not possible to distinguish these mappings merely by looking at the virtual index itself, though potential solutions include: flushing the cache after a context switch, forcing address spaces to be non-overlapping, tagging the virtual address with an address space ID (ASID). Additionally, there is a problem that virtual-to-physical mappings can change, which would require flushing cache lines, as the VAs would no longer be valid. All these issues are absent if tags use physical addresses (VIPT).
  • Virtually indexed, physically tagged (VIPT) caches use the virtual address for the index and the physical address in the tag. The advantage over PIPT is lower latency, as the cache line can be looked up in parallel with the TLB translation, however the tag cannot be compared until the physical address is available. The advantage over VIVT is that since the tag has the physical address, the cache can detect homonyms. Theoretically, VIPT requires more tags bits because some of the index bits could differ between the virtual and physical addresses (for example bit 12 and above for 4 KiB pages) and would have to be included both in the virtual index and in the physical tag. In practice this is not an issue because, in order to avoid coherency problems, VIPT caches are designed to have no such index bits (e.g., by limiting the total number of bits for the index and the block offset to 12 for 4 KiB pages); this limits the size of VIPT caches to the page size times the associativity of the cache.
  • Physically indexed, virtually tagged (PIVT) caches are often claimed in literature to be useless and non-existing.[32] However, the MIPS R6000 uses this cache type as the sole known implementation.[33] The R6000 is implemented in emitter-coupled logic, which is an extremely fast technology not suitable for large memories such as a TLB. The R6000 solves the issue by putting the TLB memory into a reserved part of the second-level cache having a tiny, high-speed TLB "slice" on chip. The cache is indexed by the physical address obtained from the TLB slice. However, since the TLB slice only translates those virtual address bits that are necessary to index the cache and does not use any tags, false cache hits may occur, which is solved by tagging with the virtual address.

The speed of this recurrence (the load latency) is crucial to CPU performance, and so most modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to proceed in parallel with fetching the data from the cache RAM.

But virtual indexing is not the best choice for all cache levels. The cost of dealing with virtual aliases grows with cache size, and as a result most level-2 and larger caches are physically indexed.

Caches have historically used both virtual and physical addresses for the cache tags, although virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then the physical address is available in time for tag compare, and there is no need for virtual tagging. Large caches, then, tend to be physically tagged, and only small, very low latency caches are virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by virtual hints, as described below.

Homonym and synonym problems

[edit]

A cache that relies on virtual indexing and tagging becomes inconsistent after the same virtual address is mapped into different physical addresses (homonym), which can be solved by using physical address for tagging, or by storing the address space identifier in the cache line. However, the latter approach does not help against the synonym problem, in which several cache lines end up storing data for the same physical address. Writing to such locations may update only one location in the cache, leaving the others with inconsistent data. This issue may be solved by using non-overlapping memory layouts for different address spaces, or otherwise the cache (or a part of it) must be flushed when the mapping changes.[34]

Virtual tags and hints

[edit]

The great advantage of virtual tags is that, for associative caches, they allow the tag match to proceed before the virtual to physical translation is done. However, coherence probes and evictions present a physical address for action. The hardware must have some means of converting the physical addresses into a cache index, generally by storing physical tags as well as virtual tags. For comparison, a physically tagged cache does not need to keep virtual tags, which is simpler. When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on pages not mapped by the TLB, then those entries will have to be flushed when the access rights on those pages are changed in the page table.

It is also possible for the operating system to ensure that no virtual aliases are simultaneously resident in the cache. The operating system makes this guarantee by enforcing page coloring, which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen and the software complexity and performance penalty of perfect page coloring has risen.

It can be useful to distinguish the two functions of tags in an associative cache: they are used to determine which way of the entry set to select, and they are used to determine if the cache hit or missed. The second function must always be correct, but it is permissible for the first function to guess, and get the wrong answer occasionally.

Some processors (e.g. early SPARCs) have caches with both virtual and physical tags. The virtual tags are used for way selection, and the physical tags are used for determining hit or miss. This kind of cache enjoys the latency advantage of a virtually tagged cache, and the simple software interface of a physically tagged cache. It bears the added cost of duplicated tags, however. Also, during miss processing, the alternate ways of the cache line indexed have to be probed for virtual aliases and any matches evicted.

The extra area (and some latency) can be mitigated by keeping virtual hints with each cache entry instead of virtual tags. These hints are a subset or hash of the virtual tag, and are used for selecting the way of the cache from which to get data and a physical tag. Like a virtually tagged cache, there may be a virtual hint match but physical tag mismatch, in which case the cache entry with the matching hint must be evicted so that cache accesses after the cache fill at this address will have just one hint match. Since virtual hints have fewer bits than virtual tags distinguishing them from one another, a virtually hinted cache suffers more conflict misses than a virtually tagged cache.

Perhaps the ultimate reduction of virtual hints can be found in the Pentium 4 (Willamette and Northwood cores). In these processors the virtual hint is effectively two bits, and the cache is four-way set associative. Effectively, the hardware maintains a simple permutation from virtual address to cache index, so that no content-addressable memory (CAM) is necessary to select the right one of the four ways fetched.

Page coloring

[edit]

Large physically indexed caches (usually secondary caches) run into a problem: the operating system rather than the application controls which pages collide with one another in the cache. Differences in page allocation from one program run to the next lead to differences in the cache collision patterns, which can lead to very large differences in program performance. These differences can make it very difficult to get a consistent and repeatable timing for a benchmark run.

To understand the problem, consider a CPU with a 1 MiB physically indexed direct-mapped level-2 cache and 4 KiB virtual memory pages. Sequential physical pages map to sequential locations in the cache until after 256 pages the pattern wraps around. We can label each physical page with a color of 0–255 to denote where in the cache it can go. Locations within physical pages with different colors cannot conflict in the cache.

Programmers attempting to make maximum use of the cache may arrange their programs' access patterns so that only 1 MiB of data need be cached at any given time, thus avoiding capacity misses. But they should also ensure that the access patterns do not have conflict misses. One way to think about this problem is to divide up the virtual pages the program uses and assign them virtual colors in the same way as physical colors were assigned to physical pages before. Programmers can then arrange the access patterns of their code so that no two pages with the same virtual color are in use at the same time. There is a wide literature on such optimizations (e.g. loop nest optimization), largely coming from the High Performance Computing (HPC) community.

The snag is that while all the pages in use at any given moment may have different virtual colors, some may have the same physical colors. In fact, if the operating system assigns physical pages to virtual pages randomly and uniformly, it is extremely likely that some pages will have the same physical color, and then locations from those pages will collide in the cache (this is the birthday paradox).

The solution is to have the operating system attempt to assign different physical color pages to different virtual colors, a technique called page coloring. Although the actual mapping from virtual to physical color is irrelevant to system performance, odd mappings are difficult to keep track of and have little benefit, so most approaches to page coloring simply try to keep physical and virtual page colors the same.

If the operating system can guarantee that each physical page maps to only one virtual color, then there are no virtual aliases, and the processor can use virtually indexed caches with no need for extra virtual alias probes during miss handling. Alternatively, the OS can flush a page from the cache whenever it changes from one virtual color to another. As mentioned above, this approach was used for some early SPARC and RS/6000 designs.

The software page coloring technique has been used to effectively partition the shared Last level Cache (LLC) in multicore processors.[35] This operating system-based LLC management in multicore processors has been adopted by Intel.[36]

Cache hierarchy in a modern processor

[edit]
Memory hierarchy of an AMD Bulldozer server

Modern processors have multiple interacting on-chip caches. The operation of a particular cache can be completely specified by the cache size, the cache block size, the number of blocks in a set, the cache set replacement policy, and the cache write policy (write-through or write-back).[25]

While all of the cache blocks in a particular cache are the same size and have the same associativity, typically the "higher-level" caches (called Level 1 cache) have a smaller number of blocks, smaller block size, and fewer blocks in a set, but have very short access times. "Lower-level" caches (i.e. Level 2 and below) have progressively larger numbers of blocks, larger block size, more blocks in a set, and relatively longer access times, but are still much faster than main memory.[8]

Cache entry replacement policy is determined by a cache algorithm selected to be implemented by the processor designers. In some cases, multiple algorithms are provided for different kinds of work loads.

Specialized caches

[edit]

Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch, virtual-to-physical address translation, and data fetch (see classic RISC pipeline). The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least three separate caches (instruction, TLB, and data), each specialized to its particular role.

Victim cache

[edit]

A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. The victim cache lies between the main cache and its refill path, and holds only those blocks of data that were evicted from the main cache. The victim cache is usually fully associative, and is intended to reduce the number of conflict misses. Many commonly used programs do not require an associative mapping for all the accesses. In fact, only a small fraction of the memory accesses of the program require high associativity. The victim cache exploits this property by providing high associativity to only these accesses. It was introduced by Norman Jouppi from DEC in 1990.[37]

Intel's Crystalwell[38] variant of its Haswell processors introduced an on-package 128 MiB eDRAM Level 4 cache which serves as a victim cache to the processors' Level 3 cache.[39] In the Skylake microarchitecture the Level 4 cache no longer works as a victim cache.[40]

Trace cache

[edit]

One of the more extreme examples of cache specialization is the trace cache (also known as execution trace cache) found in the Intel Pentium 4 microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces of instructions that have already been fetched and decoded.[41]

A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual basic blocks or dynamic instruction traces. The Pentium 4's trace cache stores micro-operations resulting from decoding x86 instructions, providing also the functionality of a micro-operation cache. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again.[42]: 63–68 

Write Coalescing Cache (WCC)

[edit]

Write Coalescing Cache[43] is a special cache that is part of L2 cache in AMD's Bulldozer microarchitecture. Stores from both L1D caches in the module go through the WCC, where they are buffered and coalesced. The WCC's task is reducing number of writes to the L2 cache.

Micro-operation (μop or uop) cache

[edit]

A micro-operation cache (μop cache, uop cache or UC)[44] is a specialized cache that stores micro-operations of decoded instructions, as received directly from the instruction decoders or from the instruction cache. When an instruction needs to be decoded, the μop cache is checked for its decoded form which is re-used if cached; if it is not available, the instruction is decoded and then cached.

One of the early works describing μop cache as an alternative frontend for the Intel P6 processor family is the 2001 paper "Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA".[45] Later, Intel included μop caches in its Sandy Bridge processors and in successive microarchitectures like Ivy Bridge and Haswell.[42]: 121–123 [46] AMD implemented a μop cache in their Zen microarchitecture.[47]

Fetching complete pre-decoded instructions eliminates the need to repeatedly decode variable length complex instructions into simpler fixed-length micro-operations, and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. A μop cache effectively offloads the fetch and decode hardware, thus decreasing power consumption and improving the frontend supply of decoded micro-operations. The μop cache also increases performance by more consistently delivering decoded micro-operations to the backend and eliminating various bottlenecks in the CPU's fetch and decode logic.[45][46]

A μop cache has many similarities with a trace cache, although a μop cache is much simpler thus providing better power efficiency; this makes it better suited for implementations on battery-powered devices. The main disadvantage of the trace cache, leading to its power inefficiency, is the hardware complexity required for its heuristic deciding on caching and reusing dynamically created instruction traces.[48]

Branch target instruction cache

[edit]

A branch target cache or branch target instruction cache, the name used on ARM microprocessors,[49] is a specialized cache which holds the first few instructions at the destination of a taken branch. This is used by low-powered processors which do not need a normal instruction cache because the memory system is capable of delivering instructions fast enough to satisfy the CPU without one. However, this only applies to consecutive instructions in sequence; it still takes several cycles of latency to restart instruction fetch at a new address, causing a few cycles of pipeline bubble after a control transfer. A branch target cache provides instructions for those few cycles avoiding a delay after most taken branches.

This allows full-speed operation with a much smaller cache than a traditional full-time instruction cache.

Smart cache

[edit]

Smart cache is a level 2 or level 3 caching method for multiple execution cores, developed by Intel.

Smart Cache shares the actual cache memory between the cores of a multi-core processor. In comparison to a dedicated per-core cache, the overall cache miss rate decreases when cores do not require equal parts of the cache space. Consequently, a single core can use the full level 2 or level 3 cache while the other cores are inactive.[50] Furthermore, the shared cache makes it faster to share memory among different execution cores.[51]

Multi-level caches

[edit]

Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger, slower caches. Multi-level caches generally operate by checking the fastest but smallest cache, level 1 (L1), first; if it hits, the processor proceeds at high speed. If that cache misses, the slower but larger next level cache, level 2 (L2), is checked, and so on, before accessing external memory.

As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. Price-sensitive designs used this to pull the entire cache hierarchy on-chip, but by the 2010s some of the highest-performance designs returned to having large off-chip caches, which is often implemented in eDRAM and mounted on a multi-chip module, as a fourth cache level. In rare cases, such as in the mainframe CPU IBM z15 (2019), all levels down to L1 are implemented by eDRAM, replacing SRAM entirely (for cache, SRAM is still used for registers[citation needed]). Apple's ARM-based Apple silicon series, starting with the A14 and M1, have a 192 KiB L1i cache for each of the high-performance cores, an unusually large amount; however the high-efficiency cores only have 128 KiB. Since then other processors such as Intel's Lunar Lake and Qualcomm's Oryon have also implemented similar L1i cache sizes.

The benefits of L3 and L4 caches depend on the application's access patterns. Examples of products incorporating L3 and L4 caches include the following:

  • Alpha 21164 (1995) had 1 to 64 MiB off-chip L3 cache.
  • AMD K6-III (1999) had motherboard-based L3 cache.
  • IBM POWER4 (2001) had off-chip L3 caches of 32 MiB per processor, shared among several processors.
  • Itanium 2 (2003) had a 6 MiB unified level 3 (L3) cache on-die; the Itanium 2 (2003) MX 2 module incorporated two Itanium 2 processors along with a shared 64 MiB L4 cache on a multi-chip module that was pin compatible with a Madison processor.
  • Intel's Xeon MP product codenamed "Tulsa" (2006) features 16 MiB of on-die L3 cache shared between two processor cores.
  • AMD Phenom (2007) with 2 MiB of L3 cache.
  • AMD Phenom II (2008) has up to 6 MiB on-die unified L3 cache.
  • Intel Core i7 (2008) has an 8 MiB on-die unified L3 cache that is inclusive, shared by all cores.
  • Intel Haswell CPUs with integrated Intel Iris Pro Graphics have 128 MiB of eDRAM acting essentially as an L4 cache.[52]

Finally, at the other end of the memory hierarchy, the CPU register file itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory for, as an example, loop nest optimization. However, with register renaming most compiler register assignments are reallocated dynamically by hardware at runtime into a register bank, allowing the CPU to break false data dependencies and thus easing pipeline hazards.

Register files sometimes also have hierarchy: The Cray-1 (circa 1976) had eight address "A" and eight scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.)

Multi-core chips

[edit]

When considering a chip with multiple cores, there is a question of whether the caches should be shared or local to each core. Implementing shared cache inevitably introduces more wiring and complexity. But then, having one cache per chip, rather than core, greatly reduces the amount of space needed, and thus one can include a larger cache.

Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip. However, for the highest-level cache (usually L3, the last one called before accessing memory), having a global cache is desirable for several reasons, such as allowing a single core to use the whole cache, reducing data redundancy by making it possible for different processes or threads to share cached data, and reducing the complexity of utilized cache coherency protocols.[53] For example, an eight-core chip with three levels may include an L1 cache for each core, one intermediate L2 cache for each pair of cores, and one L3 cache shared between all cores.

A shared highest-level cache (usually L3, called before accessing memory), is usually referred to as a last-level cache (LLC).[54] Additional techniques are used for increasing the level of parallelism when LLC is shared between multiple cores, including slicing it into multiple pieces which are addressing certain ranges of memory addresses, and can be accessed independently.[8][55]

Separate versus unified

[edit]

In a separate cache structure, instructions and data are cached separately, meaning that a cache line is used to cache either instructions or data, but not both; various benefits have been demonstrated with separate data and instruction translation lookaside buffers.[56] In a unified structure, this constraint is not present, and cache lines can be used to cache both instructions and data.

Exclusive versus inclusive

[edit]

Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive. Other processors (like the AMD Athlon) have exclusive caches: data are guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel Pentium II, III, and 4) do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy;[57][58] two common names are "non-exclusive" and "partially-inclusive".

The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does.[58]

One advantage of strictly inclusive caches is that when external devices or other processors in a multiprocessor system wish to remove a cache line from the processor, they need only have the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache must be checked as well. As a drawback, there is a correlation between the associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the effective associativity of the L1 caches is restricted. Another disadvantage of inclusive cache is that whenever there is an eviction in L2 cache, the (possibly) corresponding lines in L1 also have to get evicted in order to maintain inclusiveness. This is quite a bit of work, and would result in a higher L1 miss rate.[58]

Another advantage of inclusive caches is that the larger cache can use larger cache lines, which reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit.) If the secondary cache is an order of magnitude larger than the primary, and the cache data are an order of magnitude larger than the cache tags, this tag area saved can be comparable to the incremental area needed to store the L1 cache data in the L2.[59]

Scratchpad memory

[edit]

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress.

Example: the K8

[edit]

To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core in the AMD Athlon 64 CPU.[60]

Cache hierarchy of the K8 core in the AMD Athlon 64 CPU

The K8 has four specialized caches: an instruction cache, an instruction TLB, a data TLB, and a data cache. Each of these caches is specialized:

  • The instruction cache keeps copies of 64-byte lines of memory, and fetches 16 bytes each cycle. Each byte in this cache is stored in ten bits rather than eight, with the extra bits marking the boundaries of instructions (this is an example of predecoding). The cache has only parity protection rather than ECC, because parity is smaller and any damaged data can be replaced by fresh data fetched from memory (which always has an up-to-date copy of instructions).
  • The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction fetch has its virtual address translated through this TLB into a physical address. Each entry is either four or eight bytes in memory. Because the K8 has a variable page size, each of the TLBs is split into two sections, one to keep PTEs that map 4 KiB pages, and one to keep PTEs that map 4 MiB or 2 MiB pages. The split allows the fully associative match circuitry in each section to be simpler. The operating system maps different sections of the virtual address space with different size PTEs.
  • The data TLB has two copies which keep identical entries. The two copies allow two data accesses per cycle to translate virtual addresses to physical addresses. Like the instruction TLB, this TLB is split into two kinds of entries.
  • The data cache keeps copies of 64-byte lines of memory. It is split into 8 banks (each storing 8 KiB of data), and can fetch two 8-byte data each cycle so long as those data are in different banks. There are two copies of the tags, because each 64-byte line is spread among all eight banks. Each tag copy handles one of the two accesses per cycle.

The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which store only PTEs mapping 4 KiB. Both instruction and data caches, and the various TLBs, can fill from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data caches, which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache, or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs—the operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated.

The K8 also caches information that is never stored in memory—prediction information. These caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly complex branch prediction, with tables that help predict whether branches are taken and other tables which predict the targets of branches and jumps. Some of this information is associated with instructions, in both the level 1 instruction cache and the unified secondary cache.

The K8 uses an interesting trick to store prediction information with instructions in the secondary cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by an alpha particle strike) by either ECC or parity, depending on whether those lines were evicted from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC code, lines from the instruction cache have a few spare bits. These bits are used to cache branch prediction information associated with those instructions. The net result is that the branch predictor has a larger effective history table, and so has better accuracy.

More hierarchies

[edit]

Other processors have other kinds of predictors (e.g., the store-to-load bypass predictor in the DEC Alpha 21264).

These predictors are caches in that they store information that is costly to compute. Some of the terminology used when discussing predictors is the same as that for caches (one speaks of a hit in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.

The K8 keeps the instruction and data caches coherent in hardware, which means that a store into an instruction closely following the store instruction will change that following instruction. Other processors, like those in the Alpha and MIPS family, have relied on software to keep the instruction cache coherent. Stores are not guaranteed to show up in the instruction stream until a program calls an operating system facility to ensure coherency.

Tag RAM

[edit]
Tag RAM on board of a Intel Pentium III

In computer engineering, a tag RAM is used to specify which of the possible memory locations is currently stored in a CPU cache.[61][62] For a simple, direct-mapped design fast SRAM can be used. Higher associative caches usually employ content-addressable memory.

Implementation

[edit]

Cache reads are the most common CPU operation that takes more than a single cycle. Program execution time tends to be very sensitive to the latency of a level-1 data cache hit. A great deal of design effort, and often power and silicon area are expended making the caches as fast as possible.

The simplest cache is a virtually indexed direct-mapped cache. The virtual address is calculated with an adder, the relevant portion of the address extracted and used to index an SRAM, which returns the loaded data. The data are byte aligned in a byte shifter, and from there are bypassed to the next operation. There is no need for any tag checking in the inner loop – in fact, the tags need not even be read. Later in the pipeline, but before the load instruction is retired, the tag for the loaded data must be read, and checked against the virtual address to make sure there was a cache hit. On a miss, the cache is updated with the requested cache line and the pipeline is restarted.

An associative cache is more complicated, because some form of tag must be read to determine which entry of the cache to select. An N-way set-associative level-1 cache usually reads all N possible tags and N data in parallel, and then chooses the data associated with the matching tag. Level-2 caches sometimes save power by reading the tags first, so that only one data element is read from the data SRAM.

Read path for a 2-way associative cache

The adjacent diagram is intended to clarify the manner in which the various fields of the address are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows the SRAMs, indexing, and multiplexing for a 4 KiB, 2-way set-associative, virtually indexed and virtually tagged cache with 64 byte (B) lines, a 32-bit read width and 32-bit virtual address.

Because the cache is 4 KiB and has 64 B lines, there are just 64 lines in the cache, and we read two at a time from a Tag SRAM which has 32 rows, each with a pair of 21 bit tags. Although any function of virtual address bits 31 through 6 could be used to index the tag and data SRAMs, it is simplest to use the least significant bits.

Similarly, because the cache is 4 KiB and has a 4 B read path, and reads two ways for each access, the Data SRAM is 512 rows by 8 bytes wide.

A more modern cache might be 16 KiB, 4-way set-associative, virtually indexed, virtually hinted, and physically tagged, with 32 B lines, 32-bit read width and 36-bit physical addresses. The read path recurrence for such a cache looks very similar to the path above. Instead of tags, virtual hints are read, and matched against a subset of the virtual address. Later on in the pipeline, the virtual address is translated into a physical address by the TLB, and the physical tag is read (just one, as the virtual hint supplies which way of the cache to read). Finally the physical address is compared to the physical tag to determine if a hit has occurred.

Some SPARC designs have improved the speed of their L1 caches by a few gate delays by collapsing the virtual address adder into the SRAM decoders. (See sum-addressed decoder.)

History

[edit]

The early history of cache technology is closely tied to the invention and use of virtual memory.[citation needed] Because of scarcity and cost of semi-conductor memories, early mainframe computers in the 1960s used a complex hierarchy of physical memory, mapped onto a flat virtual memory space used by programs. The memory technologies would span semi-conductor, magnetic core, drum and disc. Virtual memory seen and used by programs would be flat and caching would be used to fetch data and instructions into the fastest memory ahead of processor access. Extensive studies were done to optimize the cache sizes. Optimal values were found to depend greatly on the programming language used with Algol needing the smallest and Fortran and Cobol needing the largest cache sizes.[disputeddiscuss]

In the early days of microcomputer technology, memory access was only slightly slower than register access. But since the 1980s[63] the performance gap between processor and memory has been growing. Microprocessors have advanced much faster than memory, especially in terms of their operating frequency, so memory became a performance bottleneck. While it was technically possible to have all the main memory as fast as the CPU, a more economically viable path has been taken: use plenty of low-speed memory, but also introduce a small high-speed cache memory to alleviate the performance gap. This provided an order of magnitude more capacity—for the same price—with only a slightly reduced combined performance.

First TLB implementations

[edit]

The first documented uses of a TLB were on the GE 645[64] and the IBM 360/67,[65] both of which used an associative memory as a TLB.

First instruction cache

[edit]

The first documented use of an instruction cache was on the CDC 6600.[66]

First data cache

[edit]

The first documented use of a data cache was on the IBM System/360 Model 85.[67]

In 68k microprocessors

[edit]

The 68010, released in 1982, has a "loop mode" which can be considered a tiny and special-case instruction cache that accelerates loops that consist of only two instructions. The 68020, released in 1984, replaced that with a typical instruction cache of 256 bytes, being the first 68k series processor to feature true on-chip cache memory.

The 68030, released in 1987, is basically a 68020 core with an additional 256-byte data cache, an on-chip memory management unit (MMU), a process shrink, and added burst mode for the caches.

The 68040, released in 1990, has split instruction and data caches of four kilobytes each.

The 68060, released in 1994, has the following: 8 KiB data cache (four-way associative), 8 KiB instruction cache (four-way associative), 96-byte FIFO instruction buffer, 256-entry branch cache, and 64-entry address translation cache MMU buffer (four-way associative).

In x86 microprocessors

[edit]
Example of a motherboard with an i386 microprocessor (33 MHz), 64 KiB cache (25 ns; 8 chips in the bottom left corner), 2 MiB DRAM (70 ns; 8 SIMMs to the right of the cache), and a cache controller (Austek A38202; to the right of the processor)

As the x86 microprocessors reached clock rates of 20 MHz and above in the 386, small amounts of fast cache memory began to be featured in systems to improve performance. This was because the DRAM used for main memory had significant latency, up to 120 ns, as well as refresh cycles. The cache was constructed from more expensive, but significantly faster, SRAM memory cells, which at the time had latencies around 10–25 ns. The early caches were external to the processor and typically located on the motherboard in the form of eight or nine DIP devices placed in sockets to enable the cache as an optional extra or upgrade feature.

Some versions of the Intel 386 processor could support 16 to 256 KiB of external cache.

With the 486 processor, an 8 KiB cache was integrated directly into the CPU die. This cache was termed Level 1 or L1 cache to differentiate it from the slower on-motherboard, or Level 2 (L2) cache. These on-motherboard caches were much larger, with the most common size being 256 KiB. There were some system boards that contained sockets for the Intel 485Turbocache daughtercard which had either 64 or 128 Kbyte of cache memory.[68][69] The popularity of on-motherboard cache continued through the Pentium MMX era but was made obsolete by the introduction of SDRAM and the growing disparity between bus clock rates and CPU clock rates, which caused on-motherboard cache to be only slightly faster than main memory.

The next development in cache implementation in the x86 microprocessors began with the Pentium Pro, which brought the secondary cache onto the same package as the microprocessor, clocked at the same frequency as the microprocessor.

On-motherboard caches enjoyed prolonged popularity thanks to the AMD K6-2 and AMD K6-III processors that still used Socket 7, which was previously used by Intel with on-motherboard caches. K6-III included 256 KiB on-die L2 cache and took advantage of the on-board cache as a third level cache, named L3 (motherboards with up to 2 MiB of on-board cache were produced). After the Socket 7 became obsolete, on-motherboard cache disappeared from the x86 systems.

The three-level caches were used again first with the introduction of the Intel Xeon MP "Foster Core",[70] where the L3 cache was added to the CPU die. It became common for the total cache sizes to be increasingly larger in newer processor generations, and recently (as of 2011) it is not uncommon to find Level 3 cache sizes of tens of megabytes.[71]

Intel introduced a Level 4 on-package cache with the Haswell microarchitecture. Crystalwell[38] Haswell CPUs, equipped with the GT3e variant of Intel's integrated Iris Pro graphics, effectively feature 128 MiB of embedded DRAM (eDRAM) on the same package. This L4 cache is shared dynamically between the on-die GPU and CPU, and serves as a victim cache to the CPU's L3 cache.[39]

In ARM microprocessors

[edit]

The Apple M1 CPU has 128 or 192 KiB of L1 instruction cache for each core (important for latency/single-thread performance), depending on core type. This is an unusually large L1 cache for any CPU type (not just for a laptop); the total cache memory size is not unusually large (the total is more important for throughput) for a laptop, and much larger total (e.g. L3 or L4) sizes are available in IBM's mainframes.

Current research

[edit]

Early cache designs focused entirely on the direct cost of cache and RAM and average execution speed. More recent cache designs also consider energy efficiency, fault tolerance, and other goals.[72][73]

There are several tools available to computer architects to help explore tradeoffs between the cache cycle time, energy, and area; the CACTI cache simulator[74] and the SimpleScalar instruction set simulator are two open-source options.

Multi-ported cache

[edit]

A multi-ported cache is a cache which can serve more than one request at a time. When accessing a traditional cache we normally use a single memory address, whereas in a multi-ported cache we may request N addresses at a time – where N is the number of ports that connected through the processor and the cache. The benefit of this is that a pipelined processor may access memory from different phases in its pipeline. Another benefit is that it allows the concept of super-scalar processors through different cache levels.

See also

[edit]

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A CPU cache is a small, high-speed integrated into the (CPU) that temporarily stores copies of data or instructions frequently accessed from the main , thereby reducing the average latency for operations and improving overall processor performance. The primary purpose of the CPU cache is to bridge the significant speed gap between the fast CPU core and the comparatively slower main (such as DRAM), by leveraging two key principles of program behavior: temporal locality, where recently accessed data is likely to be used again soon, and spatial locality, where data near a recently accessed location is also likely to be referenced. This approach allows the processor to fulfill most requests directly from the cache—a "hit"—in a fraction of the time required for a main access, often achieving effective access times close to the cache's own latency of just a few processor cycles. Modern CPU caches are structured as a multi-level hierarchy to balance speed, size, and cost, typically including Level 1 (L1), Level 2 (L2), and Level 3 (L3) caches, with each successive level being larger in capacity but slower in access time and farther from the CPU core. The L1 cache, the closest to the core, is usually the smallest (e.g., 32–64 KB per core) and divided into separate instruction (L1i) and data (L1d) caches to optimize for different access patterns; L2 caches are moderately larger (e.g., 256 KB–1 MB per core) and often private to each core; while L3 caches, which can span several megabytes to tens of megabytes, are typically shared across multiple cores in multi-core processors to facilitate and reduce inter-core communication overhead. Data is transferred between levels and main memory in fixed-size units called cache lines (usually 64 bytes), and cache management involves strategies for mapping addresses to cache locations—such as direct-mapped, set-associative, or fully associative organizations—to handle placement, identification via tags, and replacement policies (e.g., least recently used) when space is full. In multi-core systems, cache hierarchies also address coherence challenges to ensure consistent data views across cores, often using protocols like MESI (Modified, Exclusive, Shared, Invalid) to maintain synchronization without excessive overhead. Advances in cache design continue to focus on increasing hit rates through larger inclusive or non-inclusive structures, prefetching mechanisms, and optimizations for specific workloads, making the cache a critical of CPU efficiency in everything from embedded devices to .

Fundamentals

Definition and Purpose

A CPU cache is a small, high-speed integrated near the processor core, designed to store copies of frequently accessed data and instructions from the larger but slower main memory (DRAM). This proximity allows the cache to deliver data in a fraction of the time required for main memory access, exploiting the principle of locality—where programs tend to reuse recently accessed data or nearby addresses—to minimize latency. The fundamental purpose of the CPU cache emerged from the historical disparity in speeds between processors and main , a gap that has widened dramatically since the . Early motivations, as seen in the Model 85—the first commercial computer with a cache in —stemmed from the need to accelerate effective performance, where main cycles took several processor cycles (e.g., about 13 times longer), limiting overall throughput. Today, this gap exceeds 100-fold, with modern CPU clock cycles in the sub-nanosecond range contrasting against DRAM latencies of 50-100 nanoseconds, compelling multilevel cache hierarchies to sustain processor utilization. By reducing average memory access time through fast cache hits, the CPU cache enhances instruction execution throughput, enabling processors to operate closer to their peak speeds while also improving energy efficiency by avoiding costly main memory fetches. Cache performance is quantified by the hit rate—the fraction of memory requests satisfied directly from the cache—and the miss rate, defined as 1 minus the hit rate, which determines the frequency of slower main memory interventions. Typical hit rates in well-designed systems range from 90-99%, underscoring the cache's critical role in modern computing.

Basic Components

The fundamental building blocks of a CPU cache consist of tag storage, a array, valid and dirty bits per cache entry, and control logic to manage these elements. These components work together to enable efficient temporary storage of from main , leveraging of locality to bridge the speed gap between the processor and slower systems. Tag storage captures the upper bits of the physical for each cached block, allowing the cache to distinguish which specific region a given block represents within its set or slot. This tag field is essential for verifying whether a requested matches a stored block during access attempts. The data provides the primary storage for the actual contents of the cached blocks and is typically implemented using (SRAM) for its high speed and low latency, suitable for on-chip integration. Each entry in the data holds a fixed-size block of data, commonly 64 bytes in modern designs, though sizes such as 32 or 128 bytes appear in various architectures depending on the processor generation and optimization goals. Associated with each cache entry are control bits, including a valid bit that signals whether the block contains usable data (set to 1 upon loading a block from ) and a dirty bit that indicates if the block has been modified since loading (set to 1 on writes in write-back caches). The valid bit ensures that only initialized entries are considered during lookups, while the dirty bit tracks changes to facilitate efficient write-back to lower levels only when necessary. Control logic encompasses hardware elements such as comparators that match incoming address tags against stored tags and finite state machines that orchestrate updates to the cache state, including bit flips and data movements within the array. These circuits ensure reliable operation by handling edge cases like initial cache states where all valid bits are unset. Cache designers must balance block sizes, as larger blocks amortize tag storage overhead across more data bytes—reducing the relative cost of tags and control bits—but they also amplify miss penalties by requiring more data transfer time from main memory on misses. For instance, doubling the block size from 32 to 64 bytes can halve tag overhead per byte stored but roughly doubles the latency impact of compulsory or capacity misses.

Operation Principles

Cache Entries and Addressing

In CPU caches, is stored and retrieved in fixed-size units known as cache lines or blocks, which typically range from 32 to 64 bytes in modern processors to balance transfer efficiency and overhead. Each cache line holds a contiguous block of along with metadata, such as a tag for identification and a valid bit to indicate usability. The block offset, comprising the lowest log2\log_2 (block size) bits of the , specifies the particular byte or word within that line. To map a memory address to a cache entry, the address is decomposed into three primary fields: the block offset (lowest bits), the index (middle bits), and the tag (highest bits). The block offset, as noted, selects the byte within the line. The index bits determine the specific cache line or set where the data resides, with the number of index bits equal to log2\log_2 (number of lines or sets). The tag bits, consisting of the remaining upper address bits, uniquely identify the originating memory block to ensure correct matching during access. This decomposition enables efficient hardware lookup by first using the index to locate candidate entries, then comparing the tag for validation. For example, in a direct-mapped cache of 4 KB capacity with 32-byte lines and a 32-bit , the offset requires 5 bits (25=322^5 = 32). This results in 128 lines (4096/32=1284096 / 32 = 128) and thus 7 index bits (27=1282^7 = 128). The tag then occupies the remaining 20 bits (3257=2032 - 5 - 7 = 20). The tag is calculated by extracting the high-order bits of the full , excluding those allocated to the index and offset fields, allowing the cache to distinguish between different blocks that might map to the same index. In set-associative caches, the index selects a set of lines, with tag comparisons performed across ways within that set.

Access Process

The read access process in a CPU cache initiates when the processor issues a memory read request accompanied by a . The cache controller first decodes this address, partitioning it into tag bits, index bits, and block offset bits to locate the potential data. The index bits select a specific set of cache lines, and the tag bits are simultaneously compared against the stored tags in all ways (associative entries) within that set using (CAM) or equivalent parallel comparators. If a tag match occurs and the associated valid bit is asserted—confirming the cache line holds current, usable data—a hit is detected, and the requested bytes are fetched from the data storage array using the offset bits before being delivered to the processor. This valid bit check serves as a basic coherency validation to prevent serving stale or uninitialized data. The entire hit resolution for an L1 cache typically requires 1-4 clock cycles, accounting for parallel tag lookup, valid bit inspection, and data array access. On a miss, where no matching valid tag is found, the request propagates to the next memory level (e.g., L2 cache or ), potentially triggering a block fetch and replacement on compulsory misses. Write access mirrors the read process in address decoding and tag comparison but diverges in data handling upon a hit. Following tag match and valid bit confirmation, the incoming data updates the targeted bytes in the cache line's data array via the offset. The specific write then governs further actions: in write-through mode, the update is simultaneously written to the next level, while in write-back mode, the line is marked dirty for deferred propagation. Misses during writes may allocate a new line (write-allocate) or bypass the cache (no-allocate), forwarding the operation directly to lower levels without altering the cache state. Hit latency remains comparable to reads at 1-4 cycles for L1, as the core lookup and update occur swiftly.

Design Policies

Replacement Policies

When a cache cannot accommodate a new cache line due to being full, a replacement determines which existing line to evict to make space. These policies aim to minimize cache misses by exploiting patterns in memory access, such as temporal locality, where recently accessed is likely to be accessed again soon. Common policies balance prediction accuracy with implementation complexity in hardware. The Least Recently Used (LRU) selects for eviction the cache line that has gone the longest without being accessed. It maintains an ordering of lines based on recency, often using a stack or counters to track access times, making it well-suited for workloads exhibiting strong temporal locality. Implementing true LRU requires significant hardware overhead, such as O(n) updates per access for an n-way associative cache, which becomes prohibitive for large associativities. The First-In-First-Out (FIFO) policy treats the cache as a queue, evicting the oldest line inserted regardless of its usage history. This approach is simple and incurs minimal hardware cost, typically just a or pointer, but it performs poorly on accesses with temporal locality since it ignores recency and can evict frequently used lines prematurely. Random replacement selects a line for eviction using a pseudo-random mechanism, such as a . It requires negligible storage and update overhead compared to LRU or FIFO, avoiding complex , but its performance is suboptimal as it does not leverage access patterns, leading to higher miss rates in locality-heavy workloads. To address LRU's complexity, approximations like (PLRU) are commonly used in modern CPU caches. PLRU employs a tree-based structure or bit fields to approximate recency ordering with reduced hardware cost; for example, a tree-PLRU for an 8-way cache uses only 7 bits per set instead of full LRU counters. Evaluations show PLRU achieves rates close to LRU (often within 1-5% higher) while using far less area and power, making it preferable for high-associativity designs. A theoretical benchmark is Belady's optimal , which evicts the line whose next access is farthest in the . This requires complete knowledge of accesses, rendering it impractical for real hardware but invaluable for analyzing bounds. Replacement policies like LRU and PLRU can reduce rates by 20-50% compared to random or FIFO in typical workloads, though exact gains depend on access patterns.

Write Policies

Write policies in CPU caches dictate how write operations from the processor are managed, balancing factors such as data consistency, , and bandwidth usage between the cache and lower levels of the . These policies primarily address whether and when updates are propagated to main memory, as well as how cache lines are handled on write misses. Common strategies include write-through, write-back, and write-around approaches, often combined with allocation decisions. The write-through policy updates both the cache and the backing main memory synchronously for every write operation. This ensures immediate consistency between the cache and main memory, simplifying recovery in case of failures and avoiding the need for complex tracking mechanisms. However, it incurs higher latency and increased bandwidth demands on the memory bus due to frequent writes, making it suitable for systems where consistency is paramount over speed. In contrast, the write-back policy (also known as copy-back or write-deferred) updates only the cache line immediately, deferring the write to main memory until the line is or explicitly flushed. A is set in the cache entry to flag modifications, indicating that the data differs from main memory. This approach reduces bus traffic by batching writes and avoiding unnecessary updates to unchanged portions of a block, thereby improving overall performance in write-intensive workloads. Drawbacks include the risk of during power failures or crashes before eviction, as well as increased complexity in handling consistency across multiple cache levels. On eviction, the dirty bit triggers the write to main memory as part of the replacement process. Write-around, or write-through with no-write-allocation, bypasses the cache entirely for write misses by updating only main directly. This prevents cache pollution from temporary or streaming write that is unlikely to be read soon, conserving cache space for read-heavy accesses. It is particularly effective for workloads with large sequential writes, though it offers no caching benefits for subsequent reads to the written unless a separate read miss populates the cache. Allocation policies complement write strategies by determining whether to fetch a missing block into the cache on a write miss. Write-allocate brings the entire block from main memory into the cache before performing the update, commonly paired with write-back to support future accesses to the same line. No-write-allocate, often used with write-through or write-around, skips this fetch and writes directly to main memory, avoiding the overhead of loading irrelevant and reducing initial latency for one-time writes. The is a critical one-bit flag per cache line, primarily used in write-back policies to denote whether the line's contents have been modified since being loaded from main memory. When a write occurs, the bit is set to true, ensuring that the updated data is propagated to main memory only when necessary, such as during or cache flushes. This mechanism optimizes bandwidth by distinguishing modified lines from clean ones but adds hardware overhead for bit and coherence checks.

Associativity Mechanisms

Direct-Mapped Caches

In a direct-mapped cache, each memory block from the main is mapped to exactly one specific cache line using a portion of the known as the index bits. This mapping is determined by dividing the physical into three fields: the tag, the index, and the block offset. The index bits select the cache line, while the tag bits are compared against the stored tag in that line to verify if it matches the requested block; a full tag match is required for a hit. If the tags match, the data from the selected line is output via a simple ; otherwise, it is a . The hardware implementation of a direct-mapped cache is straightforward, requiring only one tag storage per cache line and a single for tag matching during access. This design uses a to route the data from the indexed line to the output, minimizing the complexity of the cache controller. As a result, direct-mapped caches occupy less area compared to higher-associativity designs. Key advantages of direct-mapped caches include fast lookup times, as only one tag comparison is needed, leading to lower access latency and reduced power consumption due to fewer parallel comparisons and simpler wiring. The minimal hardware overhead also makes them suitable for high-speed, area-constrained environments like primary caches in microprocessors. However, direct-mapped caches suffer from high conflict miss rates, where multiple memory blocks map to the same cache line and thrash each other out, particularly in workloads with sequential accesses that stride across the cache size, such as matrix operations or linked lists. This can degrade overall more than in set-associative caches, which mitigate such conflicts by allowing multiple lines per index. For example, in a 64 KB direct-mapped L1 cache with 32-byte lines, the cache holds 2048 lines, requiring 11 index bits to select a line (since 211=20482^{11} = 2048); the remaining upper bits form the tag for comparison, assuming a typical 32-bit or larger where the low 5 bits are the block offset.

Set-Associative Caches

Set-associative caches organize the cache memory into multiple sets, where each set contains a fixed number of cache lines, known as ways. For an n-way set-associative cache, the lower bits of the memory serve as an index to select one specific set from the total number of sets, while the upper tag bits are compared simultaneously against the tags stored in all n lines within that set to identify a potential hit. This parallel tag comparison allows for efficient lookup, as only the lines in the indexed set need to be examined, rather than the entire cache. The primary benefit of set-associative caches lies in their ability to mitigate conflict misses that plague direct-mapped caches, where multiple memory blocks map to the same cache line and thrash each other out. By permitting a block to reside in any of the n ways within its designated set, these caches provide greater flexibility in block placement, leading to lower overall miss rates for a given cache size. This design strikes a balance between the simplicity and speed of direct-mapped caches and the high hit rates of fully associative caches, offering improved performance without the prohibitive hardware costs of searching the entire cache on every access. In terms of hardware implementation, tag matching in set-associative caches typically employs parallel comparators for each way in the selected set, often combined with trees to route the matching to the processor. Alternatively, (CAM) can be used for the tag array within each set to perform simultaneous comparisons, though this increases power and area overhead compared to standard SRAM-based tags with dedicated comparators. To further optimize access speed and reduce energy, way prediction techniques forecast the most likely way containing the requested , allowing the cache to initially access only that way's data array while verifying the tag in parallel; a misprediction incurs a small penalty but avoids probing all ways upfront. Variations on the standard set-associative design include skewed-associative caches, which apply different hash functions or bit permutations to the index for each way, dispersing conflicting blocks across sets to further reduce misses without increasing associativity. Another approach is the pseudo-associative cache, which begins with a direct-mapped lookup and, upon a miss, sequentially probes one or more alternative locations using a secondary hash or fixed offset, approximating higher associativity with minimal additional hardware. However, increasing the degree of associativity in set-associative caches introduces scalability challenges, as higher n requires more parallel tag comparisons and wider multiplexers, which elevate access latency, silicon area, and dynamic power dissipation. For instance, transitioning from 2-way to 8-way associativity can double the hit latency in some designs due to the expanded critical path in the tag matching logic. Victim caches, small fully associative buffers that store recently evicted lines, serve as a hardware extension to capture many of these residual conflict misses with low overhead.

Performance Aspects

Cache Misses and Hits

A cache hit occurs when the requested or instruction is found in the cache, allowing the processor to access it directly without needing to retrieve it from a lower level of the . are categorized based on the type of access: instruction , where fetched instructions are present in the instruction cache; read , where for a load operation is available in the cache; and write , where for a store operation is located in the cache, enabling immediate update according to the cache's write policy. These typically take 1-4 clock cycles, depending on the cache level, significantly faster than accessing main memory. In contrast, a cache miss happens when the requested item is not present in the cache, requiring the processor to fetch it from a slower memory level, such as another cache or DRAM. Misses are classified into three main types: compulsory misses, which occur on the first access to a memory block that has never been referenced before; capacity misses, resulting from the cache being too small to hold all actively used blocks during program execution; and conflict misses, arising in set-associative or direct-mapped caches when multiple blocks compete for the same cache set, leading to evictions even if space is available elsewhere. Conflict misses can be influenced by replacement policies, such as least recently used (LRU), which determine which block is evicted during contention. The miss penalty represents the additional clock cycles incurred to resolve a miss, often ranging from 10 to 100 cycles for an L1 cache miss escalating to DRAM access, though this can exceed 200 cycles in modern systems due to increasing processor speeds relative to memory latency. Upon a miss, traditional blocking caches cause the processor to stall, halting instruction execution until the data is fetched. Non-blocking caches, however, permit the processor to continue executing independent instructions during the miss resolution, a technique known as hit-under-miss, which overlaps computation with memory operations. Modern processors mitigate miss impacts through techniques like hardware prefetching, which anticipates and loads likely future data into the cache ahead of time, and , which rearranges instructions to proceed past dependent loads while misses are serviced. These approaches reduce effective stall times without altering the fundamental hit/miss classification.

Performance Metrics

The performance of a CPU cache is evaluated using several key metrics that quantify its efficiency in reducing memory access latency and improving overall system throughput. The hit time represents the latency incurred for a successful cache access, typically measured in processor clock cycles. For instance, the level-1 (L1) cache often achieves a hit time of 1 cycle in idealized models or simple processor designs, allowing rapid data retrieval without stalling the . In contrast, the miss penalty is the additional time required to service a cache , which involves fetching data from a lower level of the , such as main memory, and can range from tens to hundreds of cycles depending on the system configuration. A fundamental metric integrating these factors is the Average Memory Access Time (AMAT), which provides a comprehensive measure of effective . The formula for AMAT is given by: AMAT=hit time+(miss rate×miss penalty)\text{AMAT} = \text{hit time} + (\text{miss rate} \times \text{miss penalty}) where the miss rate is the fraction of memory accesses that result in a cache miss. This equation captures how improvements in hit rate or reductions in miss penalty directly lower the average access time, thereby enhancing processor performance. Another important metric is the effective bandwidth, which assesses the cache's data throughput by accounting for both hits and misses. It is calculated as the total transferred divided by the total time, incorporating the high-speed delivery on (limited by cache port bandwidth) and the slower fetches on misses (constrained by lower-level memory bandwidth). In bandwidth-bound workloads, effective bandwidth can be significantly lower than peak cache bandwidth due to miss-induced stalls, emphasizing the need to minimize miss rates for sustained performance. Cache performance metrics are typically measured using hardware performance monitoring counters (PMCs), which track events such as cache misses and hits at runtime. For example, processors provide PMCs to count L1 and L2 cache miss events, enabling precise calculation of miss rates without simulation overhead. Simulation tools, such as those based on trace-driven models, complement hardware measurements by evaluating design variations under diverse workloads. A key trade-off in cache design involves balancing size against hit time: larger caches reduce capacity-related misses and improve overall hit rates, but they increase hit time due to longer access paths and higher power consumption. Studies indicate that cache sizes between 32 KB and 128 KB often optimize this trade-off, as further increases yield in miss rate reduction while proportionally raising cycle times.

Advanced Features

Cache Hierarchy

Modern CPUs typically employ a multi-level to balance speed, size, and cost, consisting of at least three levels: L1, L2, and L3 caches. The L1 cache is the smallest and fastest, usually split into separate instruction (L1i) and (L1d) caches, each around 32 KB per core, and is private to each core for minimal latency in accessing frequently used and instructions. The L2 cache is larger, typically 256 KB to 1 MB per core, and can be either private to each core or shared among a few cores, serving as an intermediate buffer with slightly higher latency than L1. The L3 cache, often called the last-level cache (LLC), is the largest, ranging from several MB to tens of MB, and is shared among all cores on a chip or socket to provide a unified pool for less frequently accessed . Cache hierarchies adopt inclusion policies to manage data duplication across levels, with inclusive and exclusive being the primary variants. In an inclusive hierarchy, all data in higher levels (L1 and L2) is also present in the lower level (L3), ensuring that the LLC contains a superset of upper-level contents, which simplifies coherence but may waste space due to . Conversely, an exclusive hierarchy prohibits overlap, where data evicted from L1 or L2 is not stored in L3, maximizing effective cache capacity but complicating of cache states. Many processors, such as , use non-inclusive policies for L3 to balance these trade-offs, allowing some but not all upper-level data in the LLC. At the L1 level, caches are often separate for instructions and data to optimize fetch and load/store operations, while L2 and L3 are typically unified, handling both types of accesses in a single structure to improve utilization in mixed workloads. In multi-core processors, L1 and L2 remain private to each core for low-latency access, whereas the shared L3 facilitates inter-core and reduces off-chip traffic, enhancing overall system coherency. To maintain consistency across the hierarchy and cores, snoop protocols are employed, where caches monitor (or "snoop") bus or interconnect transactions to detect modifications and invalidate or update local copies accordingly. In multi-level setups, snoop filters track line locations in private caches, filtering unnecessary probes to the shared L3 and minimizing coherence overhead in multi-socket systems.

Specialized Cache Types

Specialized caches address specific performance bottlenecks in processors by tailoring storage and access mechanisms to particular types of data or operations, often complementing standard cache hierarchies. The victim cache, introduced by Norman Jouppi, is a small, fully associative buffer that captures cache lines recently evicted from a direct-mapped primary cache, thereby reducing conflict misses by providing quick access to these "victims" without fetching from lower memory levels. Simulations in the original work showed that a victim cache of just 1 to 5 entries could eliminate up to 85% of conflict misses in direct-mapped caches for benchmark workloads. This design is particularly effective in environments where associativity is limited to minimize hardware cost and access latency. Trace caches store sequences of decoded instructions, known as traces, rather than raw instruction bytes, enabling faster fetch in out-of-order processors by delivering contiguous micro-operations (μops) across branch boundaries. Pioneered in by Rotenberg et al., this approach improves instruction bandwidth by making non-contiguous appear sequential, with evaluations indicating up to 30% higher fetch rates in wide-issue superscalar designs. implemented trace caching in the processor's , where the execution trace cache served as the primary L1 instruction cache, holding up to 12K μops and achieving hit rates comparable to 8-16 KB conventional caches. Micro-op caches, or μop caches, hold pre-decoded sequences of micro-operations to bypass the instruction decoder in subsequent executions, reducing front-end latency in complex instruction set architectures like x86. First deployed in Intel's , this cache stores 1.5K μops per core, with subsequent architectures expanding capacity to 4K–6K μops, allowing hot code regions to avoid repeated decoding and improving overall throughput by 5-10% in integer workloads. The structure is typically set-associative and integrated near the decoder, prioritizing frequently executed paths to minimize power and delay in the front-end. The branch target buffer (BTB) is a specialized cache that stores the target addresses of branch instructions alongside prediction information, enabling rapid resolution of by prefetching instructions from predicted targets without full address decoding. As described by Lee and Smith, BTBs function as a indexed by , with early designs using 512-2048 entries to achieve misprediction penalties under 10 cycles in pipelined processors. Modern implementations often employ multi-level BTBs with varying associativities to balance accuracy and latency, significantly boosting instruction fetch efficiency in branch-heavy workloads. Write coalescing caches merge multiple small or non-contiguous writes into larger, contiguous blocks before propagating them to lower memory levels, thereby improving bandwidth utilization and reducing I/O overhead in write-intensive scenarios. This technique, explored in last-level cache designs, maximizes in-cache displacement to coalesce overwrites, with studies showing up to 40% reduction in writeback traffic for applications with fragmented stores. Commonly applied in embedded systems or storage accelerators, it avoids unnecessary line evictions by buffering writes until a full cache line is accumulated. Smart caches enable dynamic reconfiguration of cache resources, such as reallocating ways or sets between instruction and based on demands, to optimize hit rates without fixed partitioning. Proposed in reconfigurable cache architectures, this adaptability allows processors to shift capacity—e.g., increasing instruction cache space for compute-bound tasks—yielding savings of 20-50% in media processing benchmarks through runtime partitioning. Intel's Smart Cache extends this concept to multi-core shared L2/L3 levels, dynamically assigning space to active cores, though core-specific I/ reconfiguration remains prominent in research for single-thread efficiency.

Implementation Techniques

Address Translation in Caches

In modern processors, CPU caches must interface with systems, where programs operate using virtual addresses (VAs) that are translated to physical addresses (PAs) by the (MMU). This translation introduces complexities in cache design, as caches need to efficiently handle both virtual and physical addressing to minimize latency while ensuring consistency. The primary approaches to address translation in caches are virtually indexed, physically tagged (VIPT) and physically indexed, physically tagged (PIPT) schemes, each balancing speed, size, and correctness differently. VIPT caches, commonly used for L1 instruction and caches, generate the cache index from bits of the virtual while storing the in the tag field for comparison. This design allows the cache lookup to proceed in parallel with the (TLB) lookup, as the virtual index bits are available immediately from the processor, potentially overlapping the latency and reducing overall access time. However, VIPT requires that the index bits do not overlap with the virtual page offset in a way that exceeds the physical page size, limiting cache size to avoid issues. In contrast, PIPT caches use the full for both indexing and tagging, necessitating a complete VA-to-PA before any cache access can occur. This approach eliminates problems inherent in virtual addressing but incurs higher latency, as the TLB and walk must complete serially before the cache probe, making it more suitable for larger, lower-level caches like L2 or L3 where consistency is prioritized over speed. A key challenge in VIPT caches is the synonym problem, where multiple virtual addresses map to the same physical location, potentially leading to multiple cache entries holding the same and causing inconsistencies during updates or coherence operations. This arises because different processes or threads may alias the same physical page via distinct virtual mappings, and solutions typically involve flushing the cache on switches or using hardware/software mechanisms for invalidation to ensure only one valid entry remains. Another issue, known as the problem, occurs when different physical pages have virtual es sharing the same low-order bits used for indexing, resulting in false cache misses or conflicts within the same set. This is mitigated through page coloring techniques, where the operating allocates physical pages such that their low-order bits (colors) match the corresponding virtual page bits, preventing unrelated pages from colliding in the cache index. The TLB plays a crucial role in facilitating efficient for caches by caching recent VA-to-PA mappings from the , allowing quick resolution of translations without frequent main memory accesses. In VIPT designs, the TLB provides the physical tag bits concurrently with the virtual index, enabling hit detection shortly after translation; a TLB miss, however, triggers a page table walk that can stall the cache access. Some advanced designs incorporate virtual address hints to further optimize translation paths in hybrid caching schemes.

Multi-Port and On-Chip Designs

Multi-ported caches enable simultaneous access to from multiple stages or execution units in modern processors, supporting higher instruction throughput in pipelined and out-of-order designs. Dual-ported caches, for instance, allow one for instruction fetches and another for operations, while triple-ported variants further accommodate additional reads from or load/store units, though they increase area by 20-50% compared to single-ported equivalents. These designs mitigate bandwidth bottlenecks in wide-issue processors, where issue widths exceed four , but require careful banking to avoid port conflicts. On-chip caches are predominantly implemented using (SRAM) cells, with the conventional 6-transistor (6T) cell providing stable single-port access suitable for basic read/write operations in L1 and L2 caches. For multi-ported needs, 8-transistor (8T) cells are employed, incorporating separate read and write ports to enhance stability and speed under contention, albeit at the cost of about 20% higher area per bit than 6T cells. Compared to off-chip dynamic RAM (DRAM), on-chip SRAM offers 5-10x lower latency and no refresh overhead, but exhibits 10-100x lower density and higher static power consumption per bit due to always-on leakage. To optimize access times and area, cache implementations often separate the tag RAM from the data array, using dedicated high-speed SRAM for tags that store address identifiers while the larger data array holds the actual cache lines. This decoupling allows parallel tag lookups without activating the full data array until a hit is confirmed, reducing by up to 25% in set-associative caches and enabling smaller, faster tag storage optimized for comparison logic. In contemporary multi-core processors, caches are tightly integrated on-chip within system-on-chip (SoC) designs, with shared L3 caches scaling to 256–1,152 MB in server chips like the 9005 series (as of 2025) to support inter-core data sharing and reduce off-chip traffic. Advanced process nodes, such as 3 nm, enable denser SRAM integration, as seen in chips with up to 96 MB of stacked L3 cache per core complex die using technologies like 's 3D V-Cache (introduced in 2022), balancing capacity gains with fabrication feasibility. Fabricating large on-chip caches poses challenges in yield and thermal management, as increasing array sizes amplifies defect probabilities, potentially dropping yields below 80% for multi-megabyte structures without redundancy techniques. Heat dissipation intensifies with cache size, given SRAM's leakage-dominated power profile, leading to hotspots that can elevate die temperatures by 20-30°C and necessitate advanced cooling like microchannel liquid systems in high-performance servers.

Historical Development

Early Innovations

The concept of cache memory originated in the mid-1960s with ' proposal for "slave memories," a small, fast auxiliary storage acting as a buffer between the processor and a larger, slower main to exploit and reduce access times. In his seminal paper, Wilkes described a slave memory of around 32,000 words operating as a high-speed cache for frequently accessed , emphasizing dynamic allocation to minimize latency while keeping costs low compared to fully fast main storage. This idea laid the groundwork for modern caching by introducing the notion of a hierarchical system where the slave holds the most active portions of the program, fetched on demand from the master . The first commercial implementation of a CPU cache appeared in the Model 85, announced in 1968 and shipped starting in 1969. This mainframe featured a high-speed buffer storage, or cache, of 16 KB standard capacity (expandable to 32 KB), implemented using fast monolithic integrated circuits to bridge the speed gap between the processor and core main memory. The cache employed a set-associative design with dynamic sector assignment, where 16 sectors of 1 KB each were mapped to main storage blocks, using an activity list for replacement to optimize hit rates and performance in scientific workloads. Studies leading to its adoption showed hit rates exceeding 90% for typical programs, significantly boosting effective without requiring fully fast main storage. In the 1970s, IBM extended caching principles to support virtual addressing with the introduction of translation lookaside buffers (TLBs) in the System/370 architecture, announced in 1970. The TLB served as a dedicated cache for page table entries, storing recent virtual-to-real address mappings to accelerate dynamic address translation (DAT) and reduce overhead in virtual memory systems. Models like the 155, 158, 165, and 168 integrated the TLB hardware to handle paging without excessive table walks, enabling efficient multiprogramming on mainframes with up to 8 MB of memory. This innovation was crucial for the System/370's virtual storage capabilities, as it minimized translation latency to a few cycles per access. By the early 1980s, caches began appearing in microprocessors, with the —announced in 1982 and released in 1984—introducing the first on-chip instruction cache in the 68k family. This 256-byte cache improved performance by prefetching instructions ahead of execution, achieving hit rates that reduced average fetch times in pipelined operations. The design addressed the growing processor-memory speed gap in 32-bit systems, using simple direct-mapped organization for cost-effective integration on a single chip. Early cache implementations faced significant challenges, primarily the high cost of fast memory technologies like magnetic core in the 1960s and early static RAM (SRAM) in the 1970s, which limited cache sizes to kilobytes and restricted adoption to high-end systems. , while reliable, was expensive at $1–$5 per bit and slow (around 1–2 μs access), making larger caches uneconomical; early SRAM offered sub-μs speeds but at 10–100 times the cost per bit of dynamic RAM. Additionally, the lack of standardized design methodologies complicated optimization, as architects balanced associativity, replacement policies, and coherence without established benchmarks, often relying on simulations for specific workloads.

Evolution in Processor Architectures

The evolution of CPU caches within major instruction set architectures (ISAs) since the late has been driven by the need to balance increasing core counts, power efficiency, and performance demands in processors. In the x86 architecture, Intel's 80486 microprocessor, introduced in 1989, marked the first integration of an on-chip cache, featuring an 8 KB unified instruction and data cache to reduce latency compared to off-chip designs. This unified approach stored both and data in a single structure, leveraging a write-through policy for simplicity in early embedded systems. By 1993, the Pentium processor advanced this design with a split level-1 (L1) cache, comprising separate 8 KB instruction and 8 KB data caches, which allowed simultaneous access to and data, mitigating bottlenecks in superscalar execution. In modern x86 implementations, such as 's Scalable processors in the 2020s, multi-level hierarchies have scaled dramatically, with (4th Gen, 2023) offering up to 112.5 MB of shared L3 cache per socket to support up to 60 cores (as in the Platinum 8490H), enhancing data sharing in workloads. Parallel developments in RISC architectures included the (1991) with separate 8 KB instruction and data caches, and the PowerPC 601 (1993) integrating on-chip caches, influencing subsequent designs. Parallel advancements in the ISA have emphasized mobile and embedded efficiency, evolving from single-level to sophisticated multi-level caches. The StrongARM SA-110, released in 1996 by (later acquired by ), introduced a split L1 cache with 16 KB for instructions and 16 KB for data, enabling better pipelining in low-power devices like PDAs. Subsequent series processors, starting with the Cortex-A8 in 2005, adopted multi-level hierarchies, including private L1 caches per core and a shared L2 cache configurable up to 1 MB, with later models like Cortex-A76 incorporating L3 support for improved coherence in multi-core setups. In recent developments as of 2025, Apple's M-series chips, such as the M5 in the and , integrate AI-optimized cache architectures within their unified memory system, featuring a large shared system-level cache and unified exceeding 500 GB/s, tailored for acceleration via the Neural Engine. The ISA, as an open-standard alternative, has seen rapid cache integration in commercial multi-core designs. SiFive's Performance P870 series (2023 onward) employs coherent multi-core clusters with private L1 and L2 caches per core, alongside a shared L3 cache up to several MB, supporting up to 32 cores for scalable embedded and applications. Key milestones in cache evolution across ISAs include the adoption of private L2 caches per core in early multi-core processors around 2005, as seen in Intel's and AMD's , with later designs adopting shared caches around 2006 (e.g., Duo) to minimize data replication and improve inter-core communication. In 2011, Intel's introduced the () cache, a specialized L1 structure holding up to 4K decoded uops to bypass the decode stage for hot code paths, reducing power and latency in out-of-order execution.) Cache coherency protocols like MESI (Modified, Exclusive, Shared, Invalid), standard in x86 since the , have been essential for these multi-core designs, ensuring consistent data visibility across cores via snooping mechanisms. Current research explores innovative cache paradigms to address emerging workloads. Approximate caching techniques, such as those in Proximity (2025), leverage in queries to reuse cache entries with controlled error rates, reducing database hits by up to 50% in retrieval-augmented generation systems. Experimental optical caches, like Pho$ (2022), propose hybrid opto-electronic hierarchies using photonic interconnects for shared last-level caches, potentially slashing latency and energy in multi-core setups by integrating optical RAM cells.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.