Recent from talks
Contribute something
Nothing was collected or created yet.
Processor (computing)
View on WikipediaIn computing and computer science, a processor or processing unit is an electrical component (digital circuit) that performs operations on an external data source, usually memory or some other data stream.[1] It typically takes the form of a microprocessor, which can be implemented on a single or a few tightly integrated metal–oxide–semiconductor integrated circuit chips.[2][3] In the past, processors were constructed using multiple individual vacuum tubes,[4][5] multiple individual transistors,[6] or multiple integrated circuits.
The term is frequently used to refer to the central processing unit (CPU), the main processor in a system.[7] However, it can also refer to other coprocessors, such as a graphics processing unit (GPU).[8]
Traditional processors are typically based on silicon; however, researchers have developed experimental processors based on alternative materials such as carbon nanotubes,[9] graphene,[10] diamond,[11] and alloys made of elements from groups three and five of the periodic table.[12] Transistors made of a single sheet of silicon atoms one atom tall and other 2D materials have been researched for use in processors.[13] Quantum processors have been created; they use quantum superposition to represent bits (called qubits) instead of only an on or off state.[14][15]
Moore's law
[edit]
Moore's law, named after Gordon Moore, is the observation and projection via historical trend that the number of transistors in integrated circuits, and therefore processors by extension, doubles every two years.[16] The progress of processors has followed Moore's law closely.[17]
Types
[edit]Central processing units (CPUs) are the primary processors in most computers. They are designed to handle a wide variety of general computing tasks rather than only a few domain-specific tasks. If based on the von Neumann architecture, they contain at least a control unit (CU), an arithmetic logic unit (ALU), and processor registers. In practice, CPUs in personal computers are usually also connected, through the motherboard, to a main memory bank, hard drive or other permanent storage, and peripherals, such as a keyboard and mouse.
Graphics processing units (GPUs) are present in many computers and designed to efficiently perform computer graphics operations, including linear algebra. They are highly parallel, and CPUs usually perform better on tasks requiring serial processing. Although GPUs were originally intended for use in graphics, over time their application domains have expanded, and they have become an important piece of hardware for machine learning.[18]
There are several forms of processors specialized for machine learning. These fall under the category of AI accelerators (also known as neural processing units, or NPUs) and include vision processing units (VPUs) and Google's Tensor Processing Unit (TPU).
Sound chips and sound cards are used for generating and processing audio. Digital signal processors (DSPs) are designed for processing digital signals. Image signal processors are DSPs specialized for processing images in particular.
Deep learning processors, such as neural processing units are designed for efficient deep learning computation.
Physics processing units (PPUs) are built to efficiently make physics-related calculations, particularly in video games.[19]
Field-programmable gate arrays (FPGAs) are specialized circuits that can be reconfigured for different purposes, rather than being locked into a particular application domain during manufacturing.
The Synergistic Processing Element or Unit (SPE or SPU) is a component in the Cell microprocessor.
Processors based on different circuit technology have been developed. One example is quantum processors, which use quantum physics to enable algorithms that are impossible on classical computers (those using traditional circuitry). Another example is photonic processors, which use light to make computations instead of semiconducting electronics.[20] Processing is done by photodetectors sensing light produced by lasers inside the processor.[21]
See also
[edit]- Carbon nanotube computer
- Logic gate
- Processor design
- Microprocessor
- Multiprocessing
- Multiprocessor system architecture
- Multi-core processor
- Processor power dissipation
- Central processing unit
- Graphics processing unit
- Superscalar processor
- Hardware acceleration
- Von Neumann architecture
- All pages with titles containing processing unit
References
[edit]- ^ "Oxford English Dictionary". Lexico. Archived from the original on March 25, 2020. Retrieved 25 March 2020.
- ^ "Reading: The Central Processing Unit | Introduction to Computer Applications and Concepts". courses.lumenlearning.com. Retrieved 2022-01-28.
- ^ "The Silicon Engine".
- ^ Garner, Robert; Dill, Frederick (Rick) (Winter 2010). "The Legendary IBM 1401 Data Processing System" (PDF). IEEE Solid-State Circuits Magazine. 2 (1): 28–39. doi:10.1109/MSSC.2009.935295. S2CID 31608817.
- ^ "IBM100 - The IBM 700 Series". www-03.ibm.com. 2012-03-07. Archived from the original on April 3, 2012. Retrieved 2022-01-28.
- ^ "Megaprocessor". www.megaprocessor.com. Retrieved 2022-01-28.
- ^ "Oxford English Dictionary". Lexico. Archived from the original on March 25, 2020. Retrieved 25 March 2020.
- ^ Sakdhnagool, Putt (4 September 2018). "Comparative analysis of coprocessors". Concurrency and Computation: Practice and Experience. 31 (1). doi:10.1002/cpe.4756. S2CID 54473111.
- ^ Hills, Gage; Lau, Christian; Wright, Andrew; Fuller, Samuel; Bishop, Mindy D.; Srimani, Tathagata; Kanhaiya, Pritpal; Ho, Rebecca; Amer, Aya; Stein, Yosi; Murphy, Denis (2019-08-29). "Modern microprocessor built from complementary carbon nanotube transistors". Nature. 572 (7771): 595–602. Bibcode:2019Natur.572..595H. doi:10.1038/s41586-019-1493-8. ISSN 0028-0836. PMID 31462796. S2CID 201658375.
- ^ Akinwande, Deji; Huyghebaert, Cedric; Wang, Ching-Hua; Serna, Martha I.; Goossens, Stijn; Li, Lain-Jong; Wong, H.-S. Philip; Koppens, Frank H. L. (2019-09-26). "Graphene and two-dimensional materials for silicon technology". Nature. 573 (7775): 507–518. Bibcode:2019Natur.573..507A. doi:10.1038/s41586-019-1573-9. ISSN 0028-0836. PMID 31554977. S2CID 202762945.
- ^ "Using artificial intelligence to engineer materials' properties". 11 February 2019.
- ^ Riel, Heike; Wernersson, Lars-Erik; Hong, Minghwei; del Alamo, Jesús A. (August 2014). "III–V compound semiconductor transistors—from planar to nanowire structures". MRS Bulletin. 39 (8): 668–677. Bibcode:2014MRSBu..39..668R. doi:10.1557/mrs.2014.137. hdl:1721.1/99977. ISSN 0883-7694. S2CID 138353703.
- ^ Li, Ming-Yang; Su, Sheng-Kai; Wong, H.-S. Philip; Li, Lain-Jong (March 2019). "How 2D semiconductors could extend Moore's law". Nature. 567 (7747): 169–170. Bibcode:2019Natur.567..169L. doi:10.1038/d41586-019-00793-8. ISSN 0028-0836. PMID 30862924. S2CID 75136648.
- ^ "quantum computer | Description & Facts | Britannica". www.britannica.com. Retrieved 2022-01-28.
- ^ "Experimental Implementation of Fast Quantum Searching" (PDF).
- ^ "Moore's law: computer science". Britannica.com. Retrieved 2022-01-28.
- ^ "Moore's Law". www.umsl.edu. Retrieved 2022-01-28.
- ^ "CPU vs. GPU: What's the Difference?". Intel. Retrieved 2022-02-27.
- ^ "Revolution in Gaming: Physics Processing Units (PPUs) Elevate Realism with Efficient Physics-Related Calculations -PCMasters.de". PCMasters (in German). Retrieved 2023-08-10.
- ^ Sun, Chen; Wade, Mark T.; Lee, Yunsup; Orcutt, Jason S.; Alloatti, Luca; Georgas, Michael S.; Waterman, Andrew S.; Shainline, Jeffrey M.; Avizienis, Rimas R.; Lin, Sen; Moss, Benjamin R. (December 2015). "Single-chip microprocessor that communicates directly using light". Nature. 528 (7583): 534–538. Bibcode:2015Natur.528..534S. doi:10.1038/nature16454. ISSN 0028-0836. PMID 26701054. S2CID 205247044.
- ^ Yang, Sarah (2015-12-23). "Engineers demo first processor that uses light for ultrafast communications". Berkeley News. Retrieved 2022-01-28.
Processor (computing)
View on GrokipediaFundamentals
Definition
A processor, commonly known as the central processing unit (CPU), is the primary hardware component in a computer system that executes program instructions by performing the fetch, decode, and execute operations on binary data stored in memory. This core function enables the processor to interpret and carry out sequences of commands, transforming input data into meaningful output through systematic processing.[13] As a programmable electronic circuit, the processor relies on Boolean logic gates to manipulate binary signals, facilitating arithmetic operations such as addition and subtraction, logical operations like comparisons, and control flow management to direct program execution.[14] These gates form the foundational building blocks for all computational tasks, allowing the processor to respond to instructions in a deterministic manner.[13] The processor differs fundamentally from memory, which provides long-term storage for data and instructions, and from input/output (I/O) devices, which handle interactions with external peripherals; instead, it focuses exclusively on the manipulation and transient processing of data during execution.[13] The terminology has evolved from the broader "processor" to the specific "central processing unit" within the von Neumann architecture, where it designates the dedicated unit for arithmetic and control functions separate from memory and I/O.[15]Role in Computer Architecture
In the von Neumann architecture, the processor serves as the central execution core, responsible for fetching instructions and data from a unified memory system and performing computations on them. This design, outlined in John von Neumann's 1945 First Draft of a Report on the EDVAC, integrates the processor with memory and input/output (I/O) devices through a shared system bus, allowing sequential access to both program instructions and data stored in the same address space. The processor communicates with memory via address and data buses to load operands and store results, while I/O operations are facilitated through dedicated interfaces or memory-mapped regions, enabling the system to process external inputs and outputs efficiently.[16][17] A key limitation in this architecture is the processor-memory bottleneck, where the processor's high-speed computation is constrained by the slower rate of data transfer between the processor and memory hierarchies. This bottleneck arises because the shared bus creates contention for bandwidth, as the processor must repeatedly fetch instructions and data from memory, leading to idle cycles despite advances in processor clock speeds. As described in foundational analyses, this disparity has persisted, with memory access latencies growing relative to processor performance, impacting overall system throughput in data-intensive applications.[18] In multi-processor systems, processors extend their role through configurations like symmetric multiprocessing (SMP), where multiple identical processors share a common memory space to enable parallel execution of tasks. In SMP, all processors access the same physical memory and I/O resources symmetrically, allowing the operating system to distribute workloads across cores for improved scalability and fault tolerance. This shared-memory model facilitates parallelism by synchronizing access via hardware mechanisms like cache coherence protocols, though it introduces challenges such as bus contention in highly loaded scenarios.[19] The processor exerts control over the broader system by managing interrupts, scheduling tasks, and coordinating with the operating system to maintain efficient resource utilization. Upon receiving an interrupt signal from hardware or software, the processor saves the current process state, switches to kernel mode, and executes an interrupt handler before restoring the state to resume execution, ensuring timely responses to events like device completions or errors. In task scheduling, the processor collaborates with the operating system to select and switch between processes using queues and timers, typically every few milliseconds, to optimize CPU utilization across ready, running, and waiting states. This coordination is essential for the operating system to oversee process lifecycle management, including creation, termination, and inter-process communication through system calls.[20]Historical Development
Early Mechanical and Electronic Processors
The development of processors began with mechanical precursors in the 19th century, most notably Charles Babbage's Analytical Engine, conceived around 1834 and detailed in plans by 1837. This conceptual design represented an early vision of a programmable processor, featuring a "mill" that functioned as an arithmetic logic unit equivalent for performing operations like addition, subtraction, multiplication, and division, and a "store" that served as memory to hold numbers and intermediate results.[21] The engine was intended to be programmed using punched cards inspired by the Jacquard loom, allowing for conditional branching and loops, though it was never fully built due to funding and technological constraints of the era.[21] The transition to electronic processors occurred during World War II with the advent of vacuum tube technology. The Colossus, developed by British engineer Tommy Flowers and operational by December 1943, was the first programmable electronic digital computer, designed specifically for cryptanalysis of German Lorenz ciphers.[22] It employed approximately 2,500 vacuum tubes to perform logic operations on punched paper tape inputs, enabling rapid statistical analysis and pattern recognition in encrypted messages, with ten such machines ultimately built at Bletchley Park.[23] Following this, the ENIAC (Electronic Numerical Integrator and Computer), completed in 1945 by John Presper Eckert and John Mauchly at the University of Pennsylvania, became the first general-purpose electronic computer.[24] ENIAC used over 17,000 vacuum tubes for programmable arithmetic and logic functions, reconfigured via switches and patch cables for tasks like ballistic calculations, marking a shift toward versatile electronic processing.[25] A pivotal advancement came with the stored-program concept, formalized in John von Neumann's 1945 "First Draft of a Report on the EDVAC," which proposed that instructions and data be stored interchangeably in the same electronic memory, enabling processors to modify their own programs.[26] This architecture influenced subsequent designs, culminating in the EDSAC (Electronic Delay Storage Automatic Calculator), completed in 1949 by Maurice Wilkes at the University of Cambridge as the first practical full-scale stored-program computer.[27] EDSAC utilized vacuum tubes and mercury delay lines for memory, running its initial program on May 6, 1949, and providing reliable computing services for scientific calculations.[28] Early electronic processors faced significant limitations inherent to vacuum tube technology, including bulky designs that occupied entire rooms—ENIAC alone weighed over 27 metric tons and spanned 1,800 square feet.[29] High power consumption was another challenge, with ENIAC requiring about 150 kilowatts to operate its filaments and circuits, generating substantial heat that necessitated extensive cooling systems.[29] Reliability issues were paramount, as tubes frequently failed due to filament burnout or overload, leading to frequent downtime; for instance, ENIAC experienced about one tube failure every two days on average, demanding constant maintenance by teams of technicians.[30] These constraints highlighted the need for more stable components in future processor evolution.Transition to Integrated Circuits
The invention of the transistor at Bell Laboratories in 1947 represented a fundamental advancement in computing hardware, replacing the fragile and power-hungry vacuum tubes used in early electronic processors with more compact and reliable solid-state devices.[31] John Bardeen and Walter Brattain developed the point-contact transistor, a germanium-based device capable of amplifying signals and performing basic logic functions, which dramatically reduced size, heat generation, and failure rates in electronic circuits.[32] This innovation, demonstrated on December 23, 1947, under the leadership of William Shockley, paved the way for semiconductor-based computing by enabling the miniaturization of logic gates and amplifiers essential to processors.[33] The next major leap came with the development of the integrated circuit (IC), which integrated multiple transistors, resistors, and other components onto a single chip, further accelerating the transition from discrete wiring to compact semiconductor designs. In September 1958, Jack Kilby at Texas Instruments created the first IC prototype—a hybrid device on a germanium wafer that proved all circuit elements could be fabricated monolithically, eliminating the need for individual component assembly.[34] Building on this, Robert Noyce at Fairchild Semiconductor patented the planar IC in 1959, using silicon and photolithographic techniques to interconnect components on a flat surface, which allowed for scalable manufacturing and higher reliability in production.[35] These ICs reduced circuit complexity from thousands of discrete parts to dozens or hundreds on a chip, enabling the economic feasibility of complex processors.[36] The realization of fully IC-based processors arrived with the microprocessor era, exemplified by the Intel 4004, released in November 1971 as the world's first complete central processing unit on a single chip.[37] This 4-bit device, containing about 2,300 transistors fabricated on a 10-micrometer process, was originally commissioned for Busicom's programmable calculators but demonstrated the viability of a standalone CPU for general-purpose computing tasks like arithmetic and control.[38] By integrating the arithmetic logic unit, control unit, and registers onto one die, the 4004 slashed costs and power requirements compared to prior multi-chip systems, operating at 740 kHz and supporting 92 instructions.[39] This paved the way for the microprocessor revolution, particularly with the advent of more powerful 8-bit designs like the Intel 8080 in 1974, which became a cornerstone for personal computing by enabling affordable, standalone CPU chips in consumer devices.[40] Featuring 6,000 transistors and running at up to 2 MHz with enhanced instruction sets and memory addressing, the 8080 powered influential systems such as the Altair 8800, the first commercially successful personal computer kit that sparked widespread hobbyist interest and the home computer boom.[41] Its design improvements, including better interrupt handling and direct memory access, facilitated the shift toward versatile, user-buildable computers, transforming processors from specialized industrial components into accessible tools for innovation.[42]Core Components
Arithmetic Logic Unit
The Arithmetic Logic Unit (ALU) serves as the computational core of a processor, executing arithmetic and logical operations on binary data supplied by the processor's registers.[1] It handles fundamental tasks essential to program execution, enabling the manipulation of integers and binary values within the CPU.[43] The ALU performs a range of arithmetic operations, including addition, subtraction, multiplication, and division, as well as logical operations such as AND, OR, NOT, and XOR.[43] Arithmetic operations process numerical data, while logical operations manipulate bits for comparisons, masking, or bitwise computations.[1] These functions are selected via control signals that route operands through specific circuit paths.[44] In design, the ALU is constructed from combinational logic circuits, which produce outputs solely based on current inputs without memory elements, using basic logic gates like AND, OR, and XOR.[43] The unit's width matches the processor's word size, such as 32 bits or 64 bits, allowing parallel processing of entire words for efficiency.[5] This parallel structure minimizes latency in operations on multi-bit data. A key example of ALU arithmetic is binary addition, implemented through a chain of full adder circuits.[45] Each full adder handles one bit position, taking two operand bits (a and b) and a carry-in (cin), to produce a sum bit and a carry-out. The equations for a full adder are: Carry propagation occurs as the carry-out from one full adder becomes the carry-in for the next higher bit, rippling through the word until resolution.[45] Subtraction is typically achieved by adding the two's complement of the subtrahend, leveraging the same adder hardware.[46] For more complex arithmetic involving non-integer representations, floating-point units (FPUs) extend ALU capabilities as dedicated hardware or coprocessors. FPUs handle operations on binary floating-point numbers, often following the IEEE 754 standard, including addition, multiplication, and division with exponent and mantissa normalization.[47] In integration, the ALU fetches two or more operands from the processor's registers, performs the selected operation, and stores the result back into a register, all under control signals that dictate the function and data flow.[5] This tight coupling ensures efficient data processing without external intervention.[1]Control Unit
The control unit (CU) is a critical component of the central processing unit (CPU) responsible for directing the processor's operations by interpreting instructions and orchestrating the flow of data and control signals. It fetches and decodes instructions from memory, then generates a sequence of control signals that enable the arithmetic logic unit (ALU) to perform computations, facilitate data transfers to and from registers, and manage interactions with memory subsystems. This coordination ensures that each instruction is executed correctly and efficiently, serving as the "conductor" of the processor's internal activities.[48] Control units are implemented in two main types: hardwired and microprogrammed, each suited to different design priorities. A hardwired control unit employs fixed combinational and sequential logic circuits—such as gates, flip-flops, and decoders—directly wired to produce control signals based on the instruction opcode and processor state. This approach delivers high performance due to its speed, as signals are generated without intermediate storage lookups, making it ideal for reduced instruction set computing (RISC) architectures with simpler, uniform instructions. However, its rigidity complicates modifications, requiring hardware redesign for changes in instruction sets or functionality. For instance, in a basic RISC processor, combinational circuits might directly map an ADD opcode to signals activating the ALU's adder while selecting input registers.[49] In contrast, a microprogrammed control unit uses a stored set of microinstructions, or microcode, held in a read-only memory (ROM) or similar control store, to define the control signals. The CU sequences through these microinstructions to execute an instruction, offering flexibility for complex instruction set computing (CISC) architectures where instructions vary widely in complexity. Microcode allows easier implementation of intricate operations and firmware updates without full hardware redesign, though it incurs overhead from ROM access, resulting in slower signal generation compared to hardwired designs. An example in CISC processors involves a microcode ROM that breaks down a multiply instruction into a loop of microoperations stored as addresses in the control store.[50][51] At the core of both types are microoperations, the elementary steps that decompose a machine instruction into primitive actions executable by the hardware. These include register transfers (e.g., moving data from a general-purpose register to the ALU input), ALU operations (e.g., performing an add or logical AND), and memory accesses (e.g., incrementing the program counter). For a typical load instruction, microoperations might sequence as: fetch the effective address into a temporary register, issue a memory read signal, and transfer the retrieved data to the destination register. This breakdown enables precise control over timing and resource usage, with the CU ensuring microoperations occur in the correct order synchronized to the clock cycle. These signals ultimately trigger ALU execution when computational steps are required.[52][53]Registers and Cache
Registers in a processor serve as small, high-speed storage locations within the CPU core, designed to hold operands, addresses, and intermediate results for rapid access during instruction execution.[54] These registers are typically implemented using flip-flops or latches and are the fastest form of on-chip memory, with access times on the order of a single clock cycle.[55] Common types include the program counter (PC), which stores the memory address of the next instruction to be fetched; the accumulator, a special register that holds one operand for arithmetic operations in simpler architectures; and general-purpose registers (GPRs), which store data for versatile use in computations and data movement.[56][57] In modern CPUs, the number of architectural registers visible to programmers ranges from 16 to 128, though physical registers may exceed this due to advanced techniques.[54][5] The register file, which collectively manages these registers, is organized to support multiple simultaneous read and write operations, often featuring 4-8 read ports and 2-4 write ports in superscalar processors to handle parallel instruction dispatch.[58] To mitigate data hazards in out-of-order execution, register renaming dynamically maps architectural registers to a larger pool of physical registers, preventing false dependencies like write-after-read (WAR) and write-after-write (WAW) hazards.[59] This technique, pioneered in designs like the MIPS R10000, uses structures such as reorder buffers or physical register files to track and allocate registers efficiently.[60][61] Cache memory complements registers by providing a larger, yet still fast, on-chip storage hierarchy using static RAM (SRAM) to temporarily hold frequently accessed data and instructions, thereby reducing the latency penalty of fetching from slower main memory.[62] Modern processors typically feature multi-level caches, with L1 caches (split into instruction and data subsets) closest to the core at 32-64 KB per subset, offering 1-4 cycle access; L2 caches at 256 KB to 2 MB with 10-20 cycle latency; and optional L3 caches shared across cores up to 128 MB or more.[63][64][65] Cache organization employs associativity to balance speed and hit rates: direct-mapped caches map each block to a single line for simplicity and low latency, while set-associative designs (2- to 16-way common in L1/L2) allow blocks to reside in multiple lines per set, improving flexibility at the cost of comparison logic.[66] Replacement policies, such as least recently used (LRU), determine which line to evict on a miss; LRU approximates temporal locality by prioritizing recently accessed data, though approximations like pseudo-LRU are used in high-associativity caches to reduce hardware overhead.[67][68] In the overall memory hierarchy, registers function as level 0 storage with capacities of tens to hundreds of bytes, directly feeding data to the arithmetic logic unit for operations, while caches form levels 1 through 3, bridging to off-chip DRAM.[69] Cache performance hinges on hit rates, typically 90-99% for L1 in well-optimized workloads, where a miss can incur 10-100x the latency of a hit, underscoring the hierarchy's role in sustaining processor throughput.[70]Operating Principles
Instruction Cycle
The instruction cycle, also known as the fetch-decode-execute cycle, represents the fundamental process by which a processor executes program instructions in a sequential manner. This cycle is central to the von Neumann architecture, where the processor repeatedly retrieves, interprets, and carries out instructions stored in memory alongside data.[71] The cycle ensures orderly program execution by maintaining a program counter (PC) that points to the address of the next instruction to be processed.[71] The cycle consists of four primary stages: fetch, decode, execute, and an optional write-back (or store). In the fetch stage, the control unit uses the PC value to send an address over the address bus to memory, issues a read command via the control bus, and retrieves the instruction, which is then loaded into the instruction register (IR). The PC is subsequently incremented to point to the next instruction address.[71] During the decode stage, the control unit analyzes the instruction in the IR to identify the opcode (specifying the operation) and any operands (such as register numbers or memory addresses), while also fetching required data from registers or memory as needed.[71] The execute stage involves the arithmetic logic unit (ALU) performing the specified operation, such as an arithmetic computation or a branch decision, using the decoded operands.[71] If the instruction produces a result that must be stored (e.g., in memory rather than just a register), the optional write-back stage occurs, where the control unit writes the output via the data bus to the appropriate destination using a write command.[71] A key limitation of the instruction cycle in von Neumann architectures is the von Neumann bottleneck, arising from the shared bus for fetching both instructions and data from the same memory unit. This sequential access constrains throughput, as the processor must wait for memory reads and writes in each cycle, potentially slowing overall performance despite advances in internal processing speed.[72] Interrupt handling integrates into the instruction cycle to manage external or internal events requiring immediate attention, such as hardware signals or errors. After completing the execute stage of the current instruction, the processor checks for pending interrupts; if one is detected, the cycle suspends normal execution, saves the current PC and processor status word (PSW) onto a control stack or in registers, and transfers control to an interrupt service routine (ISR).[73] Upon ISR completion, the saved state is restored, allowing the original cycle to resume from the interrupted point.[73] This mechanism ensures that high-priority events do not permanently disrupt program flow. To illustrate, consider a simple ADD instruction likeADD R1, R2, R3, which adds the contents of registers R2 and R3 and stores the result in R1. In the fetch stage, assuming the PC holds address 0x00400000, the processor retrieves the 32-bit instruction (e.g., binary 0000 0000 0100 0011 0000 1000 0010 0000) from memory and loads it into the IR, then increments the PC to 0x00400004.[74] During decode, the control unit parses the opcode to recognize the addition operation and identifies R2, R3 as source registers and R1 as the destination, preparing the ALU accordingly.[74] In execute, the ALU computes the sum—for instance, if R2 holds 7 (0000 0000 0000 0111) and R3 holds 5 (0000 0000 0000 0101), the result is 12 (0000 0000 0000 1100)—which is written to R1 in the write-back stage if required.[74] The cycle then repeats for the next instruction.
Pipelining and Superscalar Execution
Pipelining is a fundamental technique in modern processors that enhances throughput by overlapping the execution of multiple instructions, dividing the instruction processing into sequential stages that can operate concurrently on different instructions.[75] In a classic five-stage pipeline, these stages typically include Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB), allowing the processor to potentially complete one instruction per clock cycle once the pipeline is full.[76] The ideal speedup from pipelining equals the number of stages, assuming no interruptions, as each stage handles a portion of the work for successive instructions simultaneously.[75] However, pipelining introduces hazards that can disrupt this overlap and reduce performance. Structural hazards occur when multiple instructions require the same hardware resource simultaneously, such as competing for the memory unit in the IF and MEM stages.[75] Data hazards arise from dependencies between instructions, including read-after-write (RAW) where a later instruction reads a register before an earlier one writes it, write-after-read (WAR), or write-after-write (WAW).[75] Control hazards stem from branches or jumps that alter the instruction flow, potentially rendering subsequent fetched instructions invalid.[75] To resolve these hazards, processors employ techniques like forwarding (also known as bypassing), which routes data directly from the output of one stage to the input of an earlier stage to avoid delays from data dependencies.[77] Stalling inserts pipeline bubbles or no-operation cycles to pause dependent instructions until hazards clear, ensuring correctness at the cost of throughput.[77] For control hazards, branch prediction uses static methods (compiler-determined) or dynamic methods (hardware-based, such as two-bit predictors) to anticipate branch outcomes and continue fetching instructions speculatively, minimizing stalls from mispredictions.[77] Superscalar execution builds on pipelining by incorporating multiple parallel execution units, enabling the processor to issue and execute multiple independent instructions per clock cycle, often termed dual-issue or wider for more units.[78] This approach exploits instruction-level parallelism (ILP) within the basic block of code, requiring instruction scheduling to identify non-dependent operations for concurrent dispatch.[78] To handle dependencies in superscalar designs, out-of-order execution decouples instruction issue from completion order, using mechanisms like reservation stations to buffer instructions awaiting operands and a common data bus for broadcasting results.[79] The MIPS R4000 exemplifies advanced pipelining, featuring an eight-stage integer pipeline that refines the classic stages for higher clock speeds and reduced latency, while supporting load/store parallelism through separate execution paths.[80] Its design incorporates hazard detection logic for forwarding and stalling, alongside dynamic branch prediction to sustain throughput in pipelined operations.[80]Processor Types
Central Processing Units
A central processing unit (CPU) is the primary component of a computer system responsible for executing instructions from programs by fetching, decoding, and performing operations on data. It serves as a versatile, general-purpose processor capable of running operating systems, applications, and diverse software tasks, adhering to standardized instruction set architectures (ISAs) such as x86, developed by Intel, or ARM, which emphasize efficiency in both high-performance and low-power scenarios.[81][82] This design enables the CPU to handle sequential and parallel computations in von Neumann architectures, where instructions and data share a common memory space, a model that has dominated computing systems since the mid-20th century.[83][84] The evolution of CPUs has progressed from single-core designs in the early microprocessor era to multi-core architectures to address performance limitations imposed by power and thermal constraints. By integrating multiple processing cores on a single chip, modern CPUs achieve higher throughput through parallelism, as exemplified by Intel's Core i7 processors, which in their 14th generation feature up to 20 cores combining performance and efficiency variants.[85] Technologies like Intel Hyper-Threading further enhance this by allowing each physical core to execute multiple threads simultaneously, effectively simulating additional virtual cores and improving resource utilization in multithreaded workloads.[86][87] Key features of CPUs include support for virtual memory, facilitated by a memory management unit (MMU) that translates virtual addresses used by software into physical addresses, enabling processes to operate in isolated, expansive address spaces larger than available physical RAM.[88] Additionally, single instruction, multiple data (SIMD) instructions, such as Intel's Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX), allow CPUs to process multiple data elements in parallel within a single instruction, accelerating tasks like multimedia processing and scientific simulations.[89] CPUs find broad applications across desktops for personal computing, servers for data center workloads, and embedded systems in devices like smartphones and IoT gadgets, where their general-purpose nature supports everything from real-time control to complex simulations.[90][91] While primarily general-purpose, CPU designs occasionally incorporate specialized variants tailored for niche domains, though these remain rooted in core ISA principles.Specialized Processors
Specialized processors are designed for specific computational tasks, optimizing hardware architecture, power efficiency, and performance for domains such as graphics rendering, signal processing, embedded control, and artificial intelligence, rather than general-purpose computing.[92] Graphics Processing Units (GPUs) feature thousands of smaller cores organized into streaming multiprocessors, enabling massive parallelism for tasks like 3D rendering and scientific simulations.[93] They employ a Single Instruction, Multiple Threads (SIMT) architecture, where groups of threads execute the same instruction on different data, facilitating efficient handling of parallel workloads.[93] NVIDIA's CUDA platform extends GPU capabilities beyond graphics to general-purpose computing, allowing developers to program these cores for accelerated matrix operations and data-parallel algorithms.[94] Digital Signal Processors (DSPs) specialize in real-time processing of digital signals, such as audio and video, through hardware optimized for mathematical operations on streams of data.[95] They typically use fixed-point arithmetic to manage precision and speed in resource-constrained environments, avoiding the overhead of floating-point units.[95] The Texas Instruments TMS320 series, for instance, incorporates dedicated multiply-accumulate (MAC) units that perform multiplication followed by addition in a single cycle, ideal for filtering and Fourier transforms in signal processing applications.[95] Microcontrollers integrate a processor core with memory, peripherals, and interfaces on a single chip, tailored for embedded systems requiring deterministic control and low power consumption.[96] The ARM Cortex-M family employs a Reduced Instruction Set Computing (RISC) design, emphasizing simplicity and efficiency to achieve ultra-low power operation in battery-powered devices like sensors and appliances.[96] For example, the Cortex-M4 supports digital signal processing extensions, including saturating arithmetic, while maintaining compatibility with real-time operating systems for tasks such as motor control and user interfaces.[96] Application-Specific Integrated Circuits (ASICs) represent custom processors fabricated for particular functions, offering superior efficiency and density compared to programmable alternatives by hardwiring logic for targeted operations.[92] Google's Tensor Processing Units (TPUs) exemplify AI-focused ASICs, with systolic array architectures optimized for matrix multiplications central to neural network training and inference.[97] TPUs integrate high-throughput tensor cores and vector units, delivering specialized performance for deep learning workloads while minimizing energy use relative to general processors.[97]Performance and Trends
Key Metrics
Clock speed, also known as clock rate or clock frequency, refers to the rate at which a processor's clock generates pulses to synchronize its operations, measured in gigahertz (GHz), where 1 GHz equals one billion cycles per second.[98] Higher clock speeds enable a processor to potentially execute more instructions per second (IPS), as each cycle typically advances the execution of instructions, but this metric alone does not fully indicate performance due to variations in how efficiently instructions are processed.[99] Instructions per cycle (IPC), the average number of instructions a processor completes in one clock cycle, serves as a key measure of architectural efficiency, with higher IPC values reflecting better utilization of hardware resources such as pipelines and multiple execution units.[100] IPC is influenced by factors like instruction complexity and parallelism, where modern processors can achieve IPC greater than 1 through techniques like superscalar execution, though actual values vary by workload.[101] Benchmarks provide standardized ways to compare processor performance across systems. The SPEC CPU 2017 suite, developed by the Standard Performance Evaluation Corporation, includes integer and floating-point workloads to evaluate compute-intensive tasks, measuring metrics like execution time for single tasks (SPECspeed) and throughput (SPECrate) to assess processor, memory, and compiler efficiency.[102] Geekbench, a cross-platform tool, scores single-core and multi-core CPU performance based on real-world tasks such as image processing and machine learning, with results calibrated against a baseline for comparability.[103] For floating-point throughput, floating-point operations per second (FLOPS) quantifies the number of arithmetic calculations on real numbers a processor can perform in one second, often used to gauge suitability for scientific computing.[104] Power efficiency metrics address the balance between performance and energy use, critical for mobile and data center applications. Performance per watt measures computational output (e.g., instructions or benchmarks) relative to power consumption, highlighting processors that deliver high throughput with lower energy draw.[105] Thermal design power (TDP), specified by manufacturers like Intel, represents the maximum heat output a processor is expected to generate under typical loads, guiding cooling requirements and indirectly indicating power limits for sustained efficiency.[106]Moore's Law and Scaling Limits
Moore's Law, formulated by Intel co-founder Gordon Moore in 1965, observed that the number of components on an integrated circuit was doubling annually, based on trends from 1959 to 1964, and predicted this pace would continue for at least a decade, enabling chips with up to 65,000 components by 1975. In a 1975 update, Moore revised the doubling interval to approximately every two years, reflecting adjustments to manufacturing realities while maintaining the exponential growth trajectory. This empirical observation became a self-fulfilling prophecy, guiding semiconductor research and investment to sustain transistor density increases, which correlated with performance improvements until the 2010s.[107][108] The law's impact is evident in the evolution of processor transistor counts, from the Intel 4004 microprocessor in 1971, which integrated 2,300 transistors on a 10-micrometer process, to processors in the 2020s featuring tens of billions, such as Apple's M3 Ultra chip with 184 billion transistors fabricated on a 3-nanometer process (as of March 2025).[109][110] This scaling enabled dramatic reductions in size, power consumption, and cost per transistor, transforming computing from room-sized machines to portable devices and fueling advancements in personal electronics, data centers, and artificial intelligence. By the early 2020s, annual transistor density improvements had slowed to about 30-40% in leading-edge nodes, yet the cumulative effect had increased integration by over seven orders of magnitude since the 1970s; in 2025, production of 2 nm nodes began, offering further refinements.[111] Physical limits to continued scaling emerged prominently in the 2000s, as transistor dimensions approached atomic scales. Quantum tunneling, where electrons leak through insulating barriers, becomes significant below 7 nanometers, increasing leakage currents and degrading switching efficiency in metal-oxide-semiconductor field-effect transistors (MOSFETs). Concurrently, the breakdown of Dennard scaling around 2005—originally proposed in 1974 to predict constant power density with linear dimension reductions—halted due to subthreshold leakage and the inability to proportionally reduce supply voltage without compromising performance, leading to rising heat dissipation per unit area. These thermal challenges, exacerbated by non-scaling interconnect delays, shifted focus from single-core clock speed increases to multi-core parallelism for performance gains. Economic barriers further constrain scaling, as the cost of advanced lithography tools and fabrication facilities escalates exponentially with each process node. For instance, building a leading-edge semiconductor fab cost about $200 million in 1983 but exceeded $10 billion by 2022, driven by the complexity of extreme ultraviolet (EUV) lithography systems that cost hundreds of millions per unit and require yields above 90% to be viable. These rising capital expenditures, combined with diminishing returns on density improvements, have prompted industry consolidation and a reevaluation of traditional planar scaling economics. To extend processor capabilities beyond conventional Moore's Law, researchers are exploring three-dimensional (3D) stacking, which integrates multiple chip layers vertically to boost density without further shrinking transistors, achieving up to 50% area reduction in logic circuits as demonstrated in experimental complementary metal-oxide-semiconductor (CMOS) stacks. Neuromorphic designs, inspired by neural architectures, offer energy-efficient alternatives; IBM's TrueNorth chip, for example, simulates 1 million neurons and 256 million synapses using 5.4 billion transistors in a 28-nanometer process, consuming only 70 milliwatts for brain-like pattern recognition tasks. Optical computing paradigms, leveraging photons for interconnects and logic, promise reduced latency and power for data-intensive applications, with integrated photonic platforms enabling non-von Neumann architectures that perform matrix operations at speeds beyond electronic limits.[112][113][114]References
- https://en.wikichip.org/wiki/intel/mcs-4/4004
