Hubbry Logo
Hardware for artificial intelligenceHardware for artificial intelligenceMain
Open search
Hardware for artificial intelligence
Community hub
Hardware for artificial intelligence
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Hardware for artificial intelligence
Hardware for artificial intelligence
from Wikipedia

Specialized computer hardware is often used to execute artificial intelligence (AI) programs faster, and with less energy, such as Lisp machines, neuromorphic engineering, event cameras, and physical neural networks. Since 2017, several consumer grade CPUs and SoCs have on-die NPUs. As of 2023, the market for AI hardware is dominated by GPUs.[1]

As of the 2020s, AI computation is dominated by graphics processing units (GPUs) and newer domain-specific accelerators such as Google’s Tensor Processing Units (TPUs), AMD’s Instinct MI300 series, and various on-device neural-processing units (NPUs) found in consumer hardware.[2][3]

Scope

[edit]

For the purposes of this article, AI hardware refers to computing components and systems specifically designed or optimized to accelerate artificial-intelligence workloads such as machine-learning training or inference. This includes general-purpose accelerators used for AI (for example, GPUs) and domain-specific accelerators (for example, TPUs, NPUs, and other AI ASICs).[4]

Event-based cameras are sometimes discussed in the context of neuromorphic computing, but they are input sensors rather than AI compute devices. Conversely, components such as memristors are basic circuit elements rather than specialized AI hardware when considered alone.[5][6]

Lisp machines

[edit]
Computer hardware

Lisp machines were developed in the late 1970s and early 1980s to make artificial intelligence programs written in the programming language Lisp run faster.

Dataflow architecture

[edit]

Dataflow architecture processors used for AI serve various purposes with varied implementations like the polymorphic dataflow[7] Convolution Engine[8] by Kinara (formerly Deep Vision), structure-driven dataflow by Hailo,[9] and dataflow scheduling by Cerebras.[10]

Component hardware

[edit]

AI accelerators

[edit]
AI Accelerators Component hardware

Since the 2010s, advances in computer hardware have led to more efficient methods for training deep neural networks that contain many layers of non-linear hidden units and a very large output layer.[11] By 2019, graphics processing units (GPUs), often with AI-specific enhancements, had displaced central processing units (CPUs) as the dominant means to train large-scale commercial cloud AI.[12] OpenAI estimated the hardware compute used in the largest deep learning projects from Alex Net (2012) to Alpha Zero (2017), and found a 300,000-fold increase in the amount of compute needed, with a doubling-time trend of 3.4 months.[13][14]

General-purpose GPUs for AI

[edit]

Since the 2010s, graphics processing units (GPUs) have been widely used to train and deploy deep learning models because of their highly parallel architecture and high memory bandwidth. Modern data-center GPUs include dedicated tensor or matrix-math units that accelerate neural-network operations.

In 2022, NVIDIA introduced the Hopper-generation H100 GPU, adding FP8 precision support and faster interconnects for large-scale model training.[15] AMD and other vendors have also developed GPUs and accelerators aimed at AI and high-performance computing workloads.[16]

Domain-specific accelerators (ASICs / NPUs)

[edit]

Beyond general-purpose GPUs, several companies have developed application-specific integrated circuits (ASICs) and neural processing units (NPUs) tailored for AI workloads. Google introduced the Tensor Processing Unit (TPU) in 2016 for deep-learning inference, with later generations supporting large-scale training through dense systolic-array designs and optical interconnects.[17] Other vendors have released similar devices—such as Apple’s Neural Engine and various on-device NPUs—that emphasize energy-efficient inference in mobile or edge computing environments.[18]

Memory and interconnects

[edit]

AI accelerators rely on fast memory and inter-chip links to manage the large data volumes of training and inference. High-bandwidth memory (HBM) stacks, standardized as HBM3 in 2023, provide terabytes-per-second throughput on modern GPUs and ASICs.[19] These accelerators are often connected through dedicated fabrics such as NVIDIA’s NVLink and NVSwitch or optical interconnects used in TPU systems to scale performance across thousands of chips.[20]

Sources

[edit]
  1. ^ "Nvidia: The chip maker that became an AI superpower". BBC News. 25 May 2023. Retrieved 18 June 2023.
  2. ^ "NVIDIA H100 Tensor Core GPU Architecture Whitepaper". NVIDIA. 2022. Retrieved 4 November 2025.
  3. ^ "Google Cloud TPU v5 Announcement". Google Cloud Blog. 2023. Retrieved 4 November 2025.
  4. ^ Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017). "Efficient Processing of Deep Neural Networks: A Tutorial and Survey". Proceedings of the IEEE. 105 (12): 2295–2329. doi:10.1109/JPROC.2017.2761740. Retrieved 4 November 2025.
  5. ^ Gallego, Guillermo (2022). "Event-based Vision: A Survey" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2020.3008413. Retrieved 4 November 2025.
  6. ^ Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. S. (2008). "The Missing Memristor Found". Nature. 453: 80–83. doi:10.1038/nature06932. Retrieved 4 November 2025.
  7. ^ Maxfield, Max (24 December 2020). "Say Hello to Deep Vision's Polymorphic Dataflow Architecture". Electronic Engineering Journal. Techfocus media.
  8. ^ "Kinara (formerly Deep Vision)". Kinara. 2022. Retrieved 2022-12-11.
  9. ^ "Hailo". Hailo. Retrieved 2022-12-11.
  10. ^ Lie, Sean (29 August 2022). Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning. Cerebras (Report). Archived from the original on 15 March 2024. Retrieved 13 December 2022.
  11. ^ Research, AI (23 October 2015). "Deep Neural Networks for Acoustic Modeling in Speech Recognition". AIresearch.com. Retrieved 23 October 2015.
  12. ^ Kobielus, James (27 November 2019). "GPUs Continue to Dominate the AI Accelerator Market for Now". InformationWeek. Retrieved 11 June 2020.
  13. ^ Tiernan, Ray (2019). "AI is changing the entire nature of compute". ZDNet. Retrieved 11 June 2020.
  14. ^ "AI and Compute". OpenAI. 16 May 2018. Retrieved 11 June 2020.
  15. ^ "NVIDIA H100 Tensor Core GPU Architecture". NVIDIA. 2022. Retrieved 4 November 2025.
  16. ^ "AMD Instinct MI300X Accelerator". AMD. 2024. Retrieved 4 November 2025.
  17. ^ "Introducing Cloud TPU v5p and the AI Hypercomputer". Google Cloud Blog. 6 December 2023. Retrieved 4 November 2025.
  18. ^ "Apple Neural Engine". Apple Machine Learning Research. Retrieved 4 November 2025.
  19. ^ "JESD238A: High Bandwidth Memory (HBM3) Standard". JEDEC. January 2023. Retrieved 4 November 2025.
  20. ^ "NVIDIA Hopper Architecture In-Depth". NVIDIA Developer Blog. 22 March 2022. Retrieved 4 November 2025.
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Hardware for refers to specialized computing components and architectures designed to accelerate the computationally intensive operations required for developing, training, and deploying AI models, particularly deep neural networks, by leveraging parallelism, optimized , and energy-efficient processing to outperform general-purpose processors like CPUs. AI computing power encompasses this infrastructure, including hardware optimized for both training and inference phases. Its importance stems from the evolution of AI large models toward multimodal and agent-based systems, which propel explosive demand for computing resources, with optical modules and other hardware emerging as critical bottlenecks. These systems address the limitations of traditional von Neumann architectures, which suffer from bottlenecks in data movement and processing, enabling faster execution of tasks such as matrix multiplications and convolutions central to workloads. Key types of AI hardware include graphics processing units (GPUs), which excel in parallel computations for training models, as exemplified by NVIDIA's Tesla P100 and Blackwell B200 series that support high-throughput floating-point operations; tensor processing units (TPUs), Google's custom featuring systolic arrays for efficient tensor operations, such as the TPU released in 2025 with 8-bit integer support and 7.37 TB/s bandwidth as of April 2025; and field-programmable gate arrays (FPGAs), reconfigurable devices like the ALINX AX7Z020 suited for real-time, adaptable AI . Additional categories encompass application-specific integrated circuits (), such as the DianNao family optimized for layers with integrated for reduced latency, and neuromorphic hardware, brain-inspired chips like Intel's Loihi that mimic for low-power in . These accelerators also incorporate advanced memory solutions, including high-bandwidth memory (HBM) and non-volatile options like SSDs, to handle the massive datasets involved in AI. The evolution of AI hardware traces back to the 2012 AlexNet breakthrough, which popularized GPUs for convolutional neural networks due to their parallel processing capabilities, marking a shift from CPU-dominated computing to specialized accelerators amid the rise of large-scale models like GPT-3. Over the past decade, performance has improved by approximately 2–3 orders of magnitude, driven by trends in lower-precision formats (e.g., INT8 for inference and emerging FP8 for training) and fabrication advances, with power consumption scaling to support systems like NVIDIA's H100 at 700 W for peak giga-operations per second (GOPs). Global AI compute capacity is doubling every 7 months, surpassing Moore's Law, with NVIDIA maintaining approximately 85% market share in the AI GPU/accelerator market despite competition from AMD (Instinct series), Intel (Gaudi series and Jaguar Shores), Google/Alphabet TPUs like Ironwood, AWS Trainium and Inferentia, Microsoft Maia, Meta MTIA, Qualcomm for edge/cloud AI, Cerebras wafer-scale engines, Tenstorrent, and Huawei. Revenue growth capture in the expanding AI hardware market is influenced by factors such as a company's baseline revenue size, the evolution of market share—for instance, semiconductor firms historically capturing 20–30% of value in technology stacks but potentially increasing to 40–50% in AI oligopolies—product diversification across AI-specific and broader portfolios in areas like compute, memory, and networking, and competitive positioning via technical leadership in digital signal processors (DSPs), switches, and co-packaged optics (CPO). External constraints, including geopolitical risks from U.S.-China rivalry, export controls, and tech sovereignty fragmentation, also play a significant role. Small pure-play firms benefit from higher revenue elasticity owing to their low baseline and concentrated exposure to AI opportunities in niche ASICs and specialized hardware. This hardware is crucial for applications ranging from autonomous vehicles and generative AI to edge devices, offering benefits like reduced energy use—critical as AI's computational demands contribute to rising global —and enhanced accuracy in fields such as . Despite these advances, challenges persist, including programming complexity for reconfigurable devices like FPGAs, inflexibility in that limits adaptability to evolving models, and issues in for large language models. The growth in the AI infrastructure sector is driven by sustained demand for memory bandwidth and compute power, advancements in software enablement, increasing enterprise adoption, sovereign AI initiatives, and the expansion of edge inference capabilities. Future directions emphasize heterogeneous integration of accelerators, domain-specific optimizations, in-memory computing with non-volatile memories like ReRAM, and emerging paradigms such as photonic and memristor-based systems to further boost efficiency and accessibility.

Historical Developments

Lisp Machines

Lisp machines were general-purpose computers specifically designed to efficiently execute programs, emerging as early specialized hardware for and symbolic computation in the 1970s and 1980s. The concept originated from a 1973 proposal by at the MIT Laboratory, where had been developed in the late and early , with the Lisp Machine Project initiated by Richard Greenblatt in 1974. This project produced prototypes like the CONS machine in 1975, followed by the influential CADR machine around 1977–1980, which offered significant performance improvements over general-purpose hardware like the and served as the foundation for commercial efforts. By the early 1980s, former MIT researchers founded companies such as in 1980 and Lisp Machines Incorporated (LMI) in 1979, commercializing these designs to support AI research and development. Key architectural features of Lisp machines were tailored to Lisp's demands for dynamic memory management and list processing, including tagged memory architectures that embedded type information directly in words for rapid type checking and dispatch. Hardware implementations often incorporated microcode to accelerate Lisp primitives such as cons, car, and cdr operations, while dedicated support for garbage collection—via methods like reference counting or mark-and-sweep—minimized pauses in symbolic computation. Virtual memory systems were optimized for handling large, fragmented data structures common in AI applications, and many models featured high-resolution bitmapped displays with early graphical user interfaces to aid interactive programming. These optimizations, building on earlier influences like the SECD machine model from the 1960s, enabled Lisp execution speeds that were significantly faster than on general-purpose hardware of the era, such as the PDP-10. Notable models included the MIT CADR, which influenced subsequent commercial machines like ' LM-2 (1980) and the more advanced 3600 series introduced in 1983, capable of handling complex AI workloads with up to 4 MB of memory expandable to 32 MB. LMI's Lambda machine (1983) and ' Explorer series (starting 1983) also drew from CADR designs, offering similar performance for around $70,000–$125,000 per unit and finding adoption in research labs for tasks like symbolic manipulation. machines, in particular, peaked in sales with over $100 million in revenue by 1986, underscoring their role in equipping AI researchers with powerful tools. The decline of Lisp machines began in the late 1980s due to the commoditization of high-performance workstations like ' models, which offered comparable or superior capabilities for via software emulation at a fraction of the cost—around $14,000 versus $100,000 for Lisp machines. The following reduced funding, such as the end of DARPA's Strategic Computing Initiative, further eroded demand, as expert systems and symbolic AI shifted toward more portable implementations on general-purpose hardware. By the early , companies like faced bankruptcy, marking the end of dedicated Lisp hardware production. Despite their short commercial lifespan, Lisp machines profoundly impacted AI by enabling efficient development of symbolic systems during the and , including expert systems like those built with tools such as and early frames-based reasoning. They facilitated rapid prototyping at institutions like MIT, where they supported vision, , and research, and influenced the standardization of in 1984. This hardware specialization accelerated advancements in declarative and functional programming paradigms central to early AI, even as later numerical approaches dominated.

Dataflow Architectures

Dataflow architectures represent a from traditional von Neumann models, where execution is driven by the availability of data operands rather than a sequential dictated by a . In this model, computations are represented as directed graphs in which nodes denote operations and edges indicate data dependencies; an operation fires only when all required inputs are present, enabling inherent parallelism without explicit synchronization. This concept originated from the work of Jack Dennis at MIT in the early 1970s, with foundational ideas outlined in a 1975 paper proposing a basic data-flow processor that emphasized demand-driven evaluation to exploit concurrency in scientific and symbolic computations. Key implementations in the 1980s demonstrated both static and dynamic variants of dataflow models. Static dataflow architectures, as pioneered by Dennis's group, restrict each data arc to a single token at a time, simplifying hardware but limiting concurrency for recursive or iterative tasks. In contrast, dynamic dataflow models, such as those in MIT's Tagged Token Dataflow Architecture (TTDA) and the Manchester Dataflow Machine, use unique tags on tokens to allow multiple instances per arc, supporting higher parallelism at the cost of increased overhead for tag management. The TTDA, developed by and colleagues, employed a multiprocessor design with actors executing functional code on tagged tokens, while the Manchester machine featured a with 32-bit microprocessors and dynamic tagging for general-purpose parallel processing, operational since 1981. At the hardware level, machines incorporate specialized components like token matching units—often content-addressable memories (CAMs)—that store incoming tokens and pair operands with matching destination and iteration tags before dispatching them to execution units. Communication occurs via packet-switched networks that route tokens asynchronously between processing elements, eliminating the need for a global clock and allowing fine-grained parallelism without von Neumann bottlenecks. These designs, typically comprising arrays of simple processors connected through switching fabrics, prioritize data movement to enable massive concurrency in graph-based computations. In early AI applications, dataflow architectures facilitated parallel evaluation of logic and functional languages, particularly variants for nondeterministic search and theorem proving. The machine, for instance, supported a implementation of -like , where and unification were modeled as token flows, accelerating AI tasks like expert systems and by distributing search spaces across nodes. Similarly, TTDA's support for functional languages enabled parallel reduction in lambda expressions, aiding symbolic AI computations such as and algorithms. These systems demonstrated potential for AI workloads with irregular parallelism, though adoption remained limited to research prototypes. Despite their theoretical appeal, architectures faced scalability challenges due to overhead in token matching and network contention, which degraded performance as the number of processing elements increased beyond dozens. Tag resolution and storage demands also imposed memory penalties, making large-scale implementations inefficient compared to emerging alternatives. Consequently, while direct successors waned by the late , principles influenced modern designs like systolic arrays, which adopt structured data flows for matrix operations in AI accelerators, retaining the operand-driven execution but with fixed topologies to mitigate overhead.

General-Purpose Hardware

Central Processing Units

Central Processing Units (CPUs) form the foundational general-purpose hardware for artificial intelligence (AI) workloads, evolving from single-core designs in the 1990s to multi-core architectures that support parallel processing through software optimizations and hardware extensions tailored for vectorized and matrix-based computations. In the 1990s, x86 processors like Intel's Pentium series operated primarily as single-core systems focused on sequential scalar instructions, which limited their ability to handle the emerging parallel demands of early AI algorithms such as neural network training. By the 2000s, the shift to multi-core designs, exemplified by Intel's Core i-series introduced in 2006, enabled concurrent execution of AI tasks, allowing libraries and frameworks to distribute workloads across cores for improved efficiency in small-scale model training and inference. ARM-based processors, which gained traction for AI in the 2010s due to their energy efficiency, further expanded CPU applicability to edge devices, where power constraints are critical. Key features enhancing CPU suitability for AI include Single Instruction Multiple Data (SIMD) instruction sets, which facilitate vectorized operations on multiple data elements simultaneously. Early SIMD extensions like (SSE) debuted in processors in 1999, followed by (AVX) in 2008, and culminating in in 2016, which introduced 512-bit registers capable of processing up to 64 floating-point operations per cycle per core for AI-relevant precisions like FP16. These extensions, particularly AVX-512's Vector Neural Network Instructions (VNNI), accelerate deep learning primitives such as convolutions and matrix multiplications by reducing instruction overhead. CPU cache hierarchies have also been refined with larger L3 caches and prefetching mechanisms to minimize data movement latency during matrix operations, a common bottleneck in AI. Complementing these hardware advances, software libraries like provide optimized implementations of (BLAS), leveraging multi-core parallelism and SIMD to execute AI building blocks such as general matrix multiplication (GEMM) on CPUs. CPUs play a vital role in AI inference on resource-constrained edge devices, where their low-latency sequential processing suits real-time applications, and in training compact models that do not require massive parallelism; they often operate in hybrid configurations, handling data preprocessing and control flow alongside specialized accelerators such as GPUs. High-core count CPUs with large caches, such as those featuring AMD's 3D V-Cache technology, improve data preparation and AI simulations by enhancing cache-heavy operations and multitasking efficiency. In GPU-accelerated AI workloads, CPUs primarily manage data loading/preprocessing, orchestration, and system tasks, while a weak CPU can cause minor bottlenecks in data pipelines but typically does not drastically limit high-end GPU performance in most consumer or local AI setups. For running smaller AI models locally, a powerful CPU on standard laptops is sufficient, though performance is significantly slower than on GPUs. Performance metrics highlight this niche: a single modern CPU core with can deliver up to 2 TFLOPS in FP16 for AI workloads, scaling to tens of TFLOPS across multi-core systems, as demonstrated in benchmarks for inference tasks like image classification. Despite these capabilities, CPUs face limitations in AI scalability, offering lower parallelism than GPUs—typically 10-100x fewer cores optimized for independent threads—and higher power consumption per operation for dense tensor computations, often exceeding 10-20 pJ per operation compared to GPU efficiencies. GPUs complement CPUs by handling high-throughput parallel tasks in training pipelines. Contemporary examples underscore CPU adaptations for AI: AMD's processors, such as the 9005 series, target server-side inference with up to 192 cores, 12-channel DDR5 memory support, and , delivering up to 37% higher AI throughput per generation for diverse model sizes. Similarly, Apple's M-series chips, starting with the M1 in , integrate ARM-based multi-core CPUs with a dedicated Neural Engine co-processor on a unified system-on-chip, enabling efficient on-device AI inference—such as in and image processing—while the CPU manages general tasks, achieving up to 11 total for the system with low power draw.

Graphics Processing Units

Graphics processing units (GPUs) were originally designed for rendering complex graphics in video games and simulations, but their architecture proved highly suitable for accelerating artificial intelligence workloads due to its emphasis on parallel processing. NVIDIA's GeForce 256, released in 1999, marked the first GPU with dedicated hardware for 3D transformations and lighting, laying the groundwork for parallel computation beyond graphics. This evolved significantly with the introduction of CUDA in 2006, a parallel computing platform that enabled general-purpose computing on GPUs (GPGPU), allowing developers to leverage GPU power for non-graphics tasks like scientific simulations and, later, AI model training. At their core, modern GPUs feature thousands of smaller cores organized in (SIMD) arrays, enabling massive parallelism for operations such as matrix multiplications central to neural networks. NVIDIA's Volta architecture, launched in 2017, introduced tensor cores—specialized hardware units optimized for mixed-precision computations in , accelerating operations like matrix multiply-accumulate in formats such as FP16 and INT8. Building on this, the architecture in 2020 incorporated unified memory models, allowing seamless data sharing between CPU and GPU without explicit transfers, which reduces latency in AI pipelines. In AI applications, GPUs have become dominant for training deep learning models, powering frameworks like TensorFlow and PyTorch that abstract GPU kernels for efficient parallel execution of backpropagation and convolutions. For running local AI models, older generations of GPUs serve as accessible options for inference, particularly with quantized models that enable efficient performance on limited hardware, while single-board computers provide platforms for edge inference; GPUs lead in inference workloads due to their parallel processing capabilities, with NVIDIA as the primary market leader holding approximately 85% of the overall AI accelerator market share (including GPUs and other types) in Q2 2025, about 92% of the discrete GPU market, and over 80% of the broader AI hardware/accelerator market, while competitors such as AMD (around 2% in AI accelerators) and Intel hold significantly smaller shares and custom ASICs (e.g., Broadcom at around 10%) compete in specific inference segments. an NVIDIA GPU with at least 8-16GB VRAM is recommended for efficient performance with smaller to medium-sized models, though larger models with 70B+ parameters typically require 24GB+ VRAM or multiple GPUs for full-precision equivalents. For instance, high-end GPUs deliver peak performance exceeding 300 TFLOPS in FP16 precision, enabling faster training of large models compared to traditional CPUs. This parallelism is particularly effective for handling the matrix-heavy computations in neural networks, often integrated with CPUs for sequential tasks in hybrid systems. Key advancements have further tailored GPUs for AI scalability in data centers. NVIDIA's A100 GPU, released in 2020, combines tensor cores with high-bandwidth memory (HBM2e) to support multi-terabyte-scale models, achieving up to 624 in INT8 for inference workloads (dense tensor operations). Additionally, multi-instance GPU (MIG) technology, introduced in the same architecture, partitions a single GPU into isolated instances for concurrent workloads, improving resource utilization in cloud environments. Subsequent architectures, such as Hopper with the H100 GPU released in 2022, deliver up to 989 TFLOPS in FP16 tensor performance, while the Blackwell architecture, launched in 2024 with the B200 GPU, achieves up to 20 petaFLOPS of AI performance, enhancing efficiency for large-scale training and as of 2025. Despite these strengths, GPUs face challenges in AI hardware, including memory bandwidth limitations that can bottleneck data movement for very large models, necessitating techniques like model parallelism. Programming complexity also persists, as developers must write custom kernels to optimize performance, which requires expertise in low-level parallel programming.

Specialized AI Accelerators

Tensor Processing Units

Tensor Processing Units (TPUs) are custom-designed application-specific integrated circuits () developed by to accelerate workloads, particularly those involving tensor operations such as matrix multiplications in neural networks. Optimized for high throughput and energy efficiency, TPUs integrate seamlessly with frameworks like via the XLA compiler, which translates high-level computations into low-level instructions for the hardware. This specialization enables low-latency and scalable , surpassing general-purpose processors in AI-specific tasks. Google began deploying TPUs internally in 2015 to handle the growing demands of its AI services, such as image recognition in and . The first generation focused on , with public announcement in 2017 and availability through Google Cloud in early 2018. Subsequent generations expanded to capabilities, with TPUs powering over 100,000 units across Google's data centers by the late . At the core of TPU architecture is a design, which efficiently performs dense matrix multiplications by streaming data through a grid of processing elements, minimizing memory accesses and power consumption. For instance, the TPU v1 features a 256×256 capable of 92 tera-operations per second (TOPS) in 8-bit integer precision. Later versions build on this with enhancements like liquid cooling, optical interconnects, sparsity support, 3D-stacked high-bandwidth memory (HBM3), and up to 4x improved interconnect bandwidth (9,216 Gb/s) to handle larger models and pods scaling to 8,960 chips. As of November 2025, recent generations include v5e (efficiency-focused, 197 TFLOPS BF16 per chip), v5p (performance variant, 459 TFLOPS BF16), v6e/ (optimized for large language models, 926 TFLOPS BF16), and v7 (inference-optimized with FP8 support, approximately 4,614 TFLOPS FP8 per chip). Key TPU versions include:
VersionRelease YearKey Features
TPU v12015 (internal), 2017 (announced)Inference-focused; 92 (INT8); for matrix ops.
TPU v22017Added support; 180 teraFLOPS (BF16); first pods with 256 chips.
TPU v32018Pod-scale with liquid cooling; 420 teraFLOPS (BF16); 8x faster than v2.
TPU v42021Enhanced sparsity and optical switches; 275 teraFLOPS (BF16/INT8); 32 GiB HBM2 memory.
TPU v5e2023Efficiency variant; 197 teraFLOPS (BF16), 393 (INT8); 16 GiB HBM; improved perf/watt.
TPU v5p2024Performance variant; 459 teraFLOPS (BF16); HBM3 memory; 2x v5e throughput.
TPU v6e (Trillium)2024LLM-optimized; 926 teraFLOPS (BF16); advanced sparsity; pods up to 8,960 chips.
TPU v7 ()2025 (announced)-focused; ~4,614 teraFLOPS (FP8); 192 GiB HBM; real-time AI support.
Edge TPU2019Mobile/edge ; 4 (INT8); integrated in Coral devices for IoT.
TPUs deliver high throughput in formats like BF16 and INT8, with v4 achieving 275 teraFLOPS per chip and low-latency inference through dedicated hardware for activations and vector operations. The XLA compiler optimizes code for these specs, enabling deterministic execution without caches or branching overhead. TPUs have significantly impacted AI by accelerating Transformer-based models in Google services, such as neural translation in and ranking in Search. For example, TPU v2 enabled training a large-scale model in an afternoon using one-eighth of a , compared to a full day on 32 high-end GPUs, demonstrating 2-3x better efficiency for similar training tasks in practice. Overall, early TPUs provided 15-30x higher performance and 30-80x better performance-per-watt than contemporary CPUs and GPUs for inference workloads. TPUs are available via Cloud for scalable cloud deployments, supporting pods up to thousands of chips for large-scale and . For edge applications, the Coral platform with Edge TPUs has been offered since 2019, enabling efficient on-device AI in IoT and mobile devices like the Pixel Neural Core. As predecessors to TPUs, graphics processing units (GPUs) laid the groundwork for parallel compute in AI but lack the same level of tensor-specific optimization.

Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are integrated circuits composed of an array of programmable logic blocks, such as configurable logic blocks containing look-up tables and flip-flops, interconnected via a reconfigurable routing network that enables users to implement custom digital circuits post-manufacturing. This allows for field reconfiguration without altering the hardware, distinguishing FPGAs from fixed-function chips. The first commercial FPGA, the XC2064 from , was introduced in 1985, marking the inception of with approximately 1,000 logic gates. Modern FPGAs, such as Intel's Stratix 10 series, integrate advanced features like high-bandwidth memory interfaces and hardened DSP blocks, supporting densities exceeding millions of logic elements for complex applications. In AI contexts, FPGAs have been adapted through overlay frameworks that abstract hardware details for neural network deployment, enabling efficient acceleration of inference tasks. Xilinx's Vitis AI, released in 2019, provides tools for quantizing models to custom precisions like INT8, optimizing for low-latency inference on FPGA resources while supporting frameworks such as and . These overlays map convolutional layers and other operations onto FPGA logic and DSP slices, reducing compilation times to minutes and facilitating rapid prototyping of AI pipelines. FPGAs offer key advantages in AI hardware through their reconfigurability, allowing designs to evolve with changing model architectures without new fabrication, which is ideal for and iterative development. In edge AI scenarios, they achieve superior power efficiency compared to GPUs, as demonstrated by Xilinx's Versal adaptive compute acceleration platform (ACAP), announced in 2018 with general availability in 2019, which incorporates dedicated AI engines for scalar, vector, and tensor processing at low power. Recent advancements as of 2025 include AMD's Versal AI Edge Series Gen 2 (2024, up to 228 INT8 for edge ) and Intel's Agilex 9 FPGAs (2024, up to 1,400 INT8 with integrated AI tensor accelerators and 40G Ethernet). These engines deliver high throughput per watt, enabling sustained performance in thermally constrained environments. Common use cases for FPGAs in AI include real-time video , where low-latency detects objects in from cameras, and 5G for on-device in base stations to minimize data transmission delays. For instance, the Alveo U280 accelerator, released in 2018, achieves approximately 21 of INT8 performance for such workloads, supporting high-bandwidth memory for efficient data handling in pipelines. More advanced devices like Intel's Stratix 10 NX FPGA reach up to 143 INT8 , illustrating scalable performance for demanding edge applications. Despite these benefits, FPGAs incur higher development costs due to the expertise required for hardware description languages like or , and their peak throughput remains lower than application-specific integrated circuits () for volume production, as FPGAs' general-purpose fabric introduces overhead in clock speeds and resource utilization. can be viewed as hardened implementations of optimized FPGA designs for fixed, high-volume AI deployments.

Application-Specific Integrated Circuits

Application-specific integrated circuits (ASICs) are custom-designed integrated circuits optimized for particular applications, such as accelerating workloads, rather than serving general-purpose computing needs. Unlike programmable hardware, ASICs are fabricated with fixed functionality tailored to specific tasks like or , enabling superior performance and energy efficiency for those operations. The development process for AI ASICs involves several stages, including architectural , verification, synthesis, and fabrication, typically spanning 12-18 months due to the complexity of custom silicon and testing. Prominent examples of AI ASICs include Apple's Neural Engine, integrated into the A11 Bionic system-on-chip released in 2017 for the , which features a dual-core neural processing unit capable of up to 600 billion operations per second for real-time tasks like image recognition. More recent iterations, such as the A18 in 16 (2024), deliver 35 tera-operations per second (). Huawei's Ascend series, starting with the Ascend 310 neural processing unit announced in 2018, targets AI and with a focus on high-throughput tensor operations suitable for edge and cloud deployments; the Ascend 910C (2024) achieves 480 teraFLOPS FP16. Graphcore's Intelligence Processing Unit (IPU), first introduced in 2016, employs a (MIMD) architecture to handle graph-based AI models efficiently, allowing fine-grained parallelism across thousands of independent processing threads; the Bow IPU (2023) uses 4-chiplet design with 350 INT8. Another notable design is Systems' Wafer-Scale Engine (WSE), unveiled in 2019, which integrates 400,000 AI-optimized cores on a single massive chip spanning 46,225 square millimeters, enabling unprecedented scale for ; the WSE-3 (2024) features 900,000 cores and 125 petaFLOPS AI performance. Major hyperscalers have developed custom AI ASICs as key competitors to NVIDIA in the AI chip market. Google's Tensor Processing Units (TPUs) are specialized ASICs for tensor operations, with ongoing iterations enhancing performance for both training and inference. Amazon's Inferentia chips are optimized for deep learning inference, with the second-generation Inferentia2 delivering up to 190 teraFLOPS of FP16 performance and supporting data types like FP32, TF32, and configurable FP8, powering EC2 Inf2 instances for cost-efficient generative AI applications. Amazon's Trainium series focuses on training, with the Trainium3, AWS's first 3nm AI chip released in 2025, providing 2.52 petaFLOPS of FP8 compute and 144 GB of HBM3e memory for advanced workloads like large language models. Microsoft's Azure Maia 100, introduced in 2023, is a custom AI accelerator built on a 5nm process for large-scale AI training and inference in Azure, featuring high-bandwidth Ethernet networking at 4.8 terabits per accelerator; the next-generation Maia 200 is slated for mass production in 2026. Meta's Meta Training and Inference Accelerator (MTIA) targets recommendation models and generative AI, with the second-generation MTIA (2024) achieving 354 teraFLOPS of FP16/BF16 performance on a 5nm process and supporting sparsity for efficient computations, deployed at scale across Meta's data centers. Key design elements in AI ASICs emphasize domain-specific optimizations to address the demands of neural networks. These include custom instruction sets for operations like matrix multiplications and convolutions, which reduce overhead compared to general-purpose instructions. On-chip memory hierarchies, often using high-bandwidth static RAM, are prioritized to enhance data locality and minimize latency from external DRAM accesses, crucial for handling large model weights. Support for model sparsity—exploiting zero-valued parameters in pruned networks—is increasingly incorporated through dedicated hardware units that skip unnecessary computations, boosting throughput without proportional power increases. In terms of performance, AI ASICs achieve high efficiency metrics, such as 10-20 times better per watt than graphics processing units for tasks, making them ideal for power-constrained environments. For instance, these chips deliver 30-80 times higher energy efficiency in tensor operations compared to contemporary GPUs like the K80. Such metrics enable deployment in resource-limited settings, including smartphones for on-device AI and autonomous vehicles for real-time processing. Tensor Processing Units (TPUs) represent a prominent subclass of AI ASICs focused on tensor operations, further illustrating this efficiency paradigm. Emerging trends in the 2020s involve adopting -based designs for AI ASICs to improve and yield, allowing modular integration of smaller dies into larger systems while reducing manufacturing risks associated with monolithic wafers. This approach facilitates higher core counts and interconnect bandwidth for massive AI models, with market projections indicating rapid growth in chiplet adoption for AI accelerators.

Neuromorphic and Emerging Hardware

Spiking Neural Network Processors

(SNN) processors represent a class of neuromorphic hardware designed to emulate the brain's event-driven computation by using discrete temporal rather than continuous activation values, drawing inspiration from biological . Unlike traditional artificial neural networks (ANNs), SNNs process information through asynchronous spike events that propagate only when thresholds are met, enabling sparse and temporally dynamic representations suitable for low-power, real-time processing. This paradigm often employs models like the leaky integrate-and-fire (LIF) , where accumulates incoming and leaks over time until firing, mimicking biological behavior. Key implementations of SNN processors include IBM's TrueNorth, introduced in 2014, which integrates 1 million neurons across 4096 cores in a 65 mW asynchronous fabricated on a 28 nm process, emphasizing scalability and defect tolerance for large-scale neuromorphic systems. Intel's Loihi, released in 2017 and detailed in a 2018 IEEE , features a 60 mm² die in 14 nm technology with on-chip learning capabilities, supporting up to 128 neuromorphic cores and enabling adaptive spike-timing-dependent plasticity for and tasks; its successor, Loihi 2 released in 2021, expands to over 1 million neurons per chip with improved performance. The 2024 Hala Point system scales Loihi 2 to 1,152 chips, achieving 1.15 billion neurons for advanced sustainable AI . BrainChip's Akida, announced in 2018, employs a fully digital, event-based design with 80 neural processing units connected via an AXI mesh, optimized for temporal in edge devices and achieving orders-of-magnitude efficiency gains through sparse spike routing. These processors typically adopt asynchronous, hybrid analog-digital circuits to handle spike-based communication, reducing power consumption to levels like TrueNorth's 65 mW during operation by avoiding constant clock-driven computations. The LIF model is central, with equations governing potential V(t)V(t) as τdVdt=V+I(t)\tau \frac{dV}{dt} = -V + I(t), where τ\tau is the , and spikes fire when VV exceeds a threshold, followed by reset; this is implemented efficiently in hardware via integrate-and-fire circuits. In AI applications, SNN processors excel in edge sensing and robotics, where their event-driven nature supports sparse, real-time tasks such as visual object recognition or motor control with latencies under milliseconds and power efficiencies up to 10 times better than ANNs for similar accuracy on benchmarks like gesture recognition. For instance, Loihi has demonstrated event-driven vision and adaptive control for UAVs by processing sensory spikes in real time with minimal energy overhead. Despite these advantages, SNN processors face challenges including an immature software ecosystem, with limited frameworks for training large models compared to ANNs, and issues in integrating billions of neurons without excessive interconnect latency or power spikes. Hardware realizations also struggle with precise analog variability in subthreshold circuits, hindering for deep network deployments.

Optical and Photonic Processors

Optical and photonic processors represent an emerging class of hardware that leverages photons for computation, offering potential solutions to the energy and speed limitations of electronic systems in workloads. These processors utilize light waves to perform operations such as matrix multiplications, which are fundamental to neural networks, by encoding data into optical signals and processing them through integrated photonic circuits. Key principles include the use of Mach-Zehnder interferometers (MZIs) to manipulate light phases for linear transformations and (WDM) to enable parallel processing across multiple optical channels, thereby bypassing electronic bottlenecks like resistive losses and delays in traditional interconnects. This approach exploits the inherent properties of photons, such as their ability to travel at the with minimal interference, to achieve high parallelism in computations essential for AI. Significant developments in photonic processors for AI include Lightmatter's Envise platform, introduced in 2021, which integrates photonic tensor cores capable of performing matrix-vector multiplications optically at speeds up to three times faster than comparable electronic systems while maintaining similar power efficiency. Similarly, Optalysys's FTalpha system, launched in 2020, employs optical Fourier transforms to accelerate convolutional s (CNNs) by performing convolutions via fast Fourier transforms (FFTs) in the optical domain, reducing computation time for image processing tasks. More recent advances include MIT's all-optical photonic processor (2024), which performs full deep computations with latencies below 0.5 nanoseconds and over 92% accuracy on tasks, and Lightmatter's Passage M1000 superchip (2025) providing 114 Tbps bandwidth for scalable AI interconnects. These prototypes demonstrate the feasibility of hybrid photonic-electronic architectures, where photonic elements handle compute-intensive linear operations and manage control and nonlinear functions. The primary advantages of photonic processors lie in their superior bandwidth and energy efficiency for AI-specific tasks. Optical interconnects can achieve petabit-per-second data rates, far exceeding electronic limits, enabling low-latency execution of linear algebra operations central to models. Prototypes have shown energy savings of up to 100 times compared to electronic counterparts for and tasks, primarily due to the absence of electrical conversion overheads and lower heat dissipation in optical processing. For instance, integrated photonic systems have demonstrated processing latencies below 0.5 nanoseconds for inferences, with accuracies over 92% on benchmarks. In applications, photonic processors excel at accelerating transformer models through efficient attention mechanisms, which rely on large-scale matrix operations, and CNNs via optical convolutions for tasks like image recognition. Integration with silicon photonics platforms, such as those developed by Ayar Labs in the 2020s, further enhances AI systems by providing optical I/O chiplets that deliver 5-10 times higher bandwidth and 3-5 times better power efficiency than electrical interconnects, supporting scalable AI fabrics for trillion-parameter models. These advancements position photonic hardware as a complement to electronic accelerators, particularly for data-center-scale AI training and inference. Despite these benefits, photonic processors face notable hurdles, including high fabrication costs stemming from the need for precision processes to integrate optical components on silicon chips. in analog optical systems, arising from misalignment, variations, and signal , can degrade accuracy in multi-layer networks. Additionally, current designs are largely limited to linear operations, as implementing nonlinear activations optically remains challenging without introducing significant power penalties or complexity. Addressing these issues will be crucial for broader adoption in AI hardware ecosystems.

Key Components and Considerations

Growth in the AI infrastructure sector is driven by sustained demand for memory bandwidth and compute power, as well as software enablement, including ramping enterprise adoption, sovereign AI builds, and edge inference. Revenue growth capture for companies in this expanding market depends on factors such as baseline revenue size, evolution of market share (e.g., increasing value capture from 20-30% in traditional technology stacks to 40-50% for AI oligopolies), product diversification (with AI-focused portfolios enabling higher growth compared to broader semiconductor offerings), competitive positioning through technical leadership in areas like digital signal processors (DSPs), switches, and co-packaged optics (CPO), and external constraints including geopolitical risks such as U.S.-China export controls and tech sovereignty initiatives. Small pure-play firms often achieve higher revenue elasticity due to their low baseline and concentrated AI exposure. Notably, global AI compute capacity is doubling every 7 months, surpassing Moore's Law, with NVIDIA holding approximately 90% market share in AI accelerators despite competition from Google TPUs, Amazon Trainium, AMD, and Huawei. These factors highlight the rapid evolution of components like memory systems and interconnects to meet the needs of scalable AI deployments.

Memory Systems

Memory systems in AI hardware are designed to handle the massive data requirements of and for large-scale models, prioritizing high bandwidth and low latency to minimize bottlenecks. Traditional architectures suffer from the von Neumann bottleneck, where frequent data shuttling between compute units and consumes significant energy and time; in-memory architectures address this by integrating processing directly within memory arrays, reducing data movement overhead. Key types include (HBM), Graphics Double Data Rate (GDDR) (SDRAM), and on-chip (SRAM) paired with (DRAM) hierarchies in accelerators. These systems enable the storage and rapid access of parameters in billion-scale models, with capacities scaling to hundreds of gigabytes and bandwidths exceeding several terabytes per second (TB/s). HBM, a 3D-stacked DRAM , provides exceptional bandwidth for AI workloads; for instance, the H100 GPU introduced HBM3 in 2022, delivering 3 TB/s of with up to 80 GB capacity, supporting efficient of large language models. GDDR, optimized for GPUs, offers a cost-effective alternative with high throughput; GDDR6X variants achieve bandwidths up to 1 TB/s per module, making them suitable for AI inference in consumer and mid-range servers where HBM's premium cost is prohibitive. In AI accelerators, SRAM serves as fast on-chip cache for immediate data access during computations like matrix multiplications, while DRAM provides larger off-chip storage; this hierarchy ensures low-latency access for tensor operations, with SRAM densities enabling up to several megabytes per accelerator die. AI-specific optimizations focus on mitigating data movement, which can account for up to 90% of energy consumption in training large models due to repeated parameter loading. Processing-in-Memory (PIM) integrates compute logic into memory chips, as exemplified by Samsung's Aquabolt-XL HBM2-PIM announced in 2021, which embeds accelerators in DRAM stacks to perform operations like vector additions in situ, improving energy efficiency by 2-3x for bandwidth-bound AI tasks. In-memory computing further reduces the von Neumann bottleneck by executing multiply-accumulate operations within memory cells, potentially cutting data transfer energy by orders of magnitude. 3D-stacked memory architectures enhance these efforts by vertically integrating logic and DRAM layers, boosting density and bandwidth; recent advancements, such as Micron's HBM3E 12-high stacks, deliver over 1.2 TB/s bandwidth with 36 GB capacity, enabling seamless handling of trillion-parameter models in AI servers. Challenges persist in scaling capacity for ever-larger models; for example, a 70-billion-parameter model in FP16 precision requires around 140 GB of memory, pushing systems toward multi-TB configurations. Non-volatile options like Intel's Optane , which offered byte-addressable storage for maintaining AI model states across power cycles, influenced designs before its discontinuation in 2022 due to market challenges, with last shipments in late 2023. Micron's HBM integrations in AI servers, such as those powering platforms, demonstrate practical scaling, with HBM3E providing 1.5x higher capacity than prior generations to support inference on models exceeding 100 billion parameters. Overall, these memory advancements, often linked via high-speed interconnects for multi-chip systems, are crucial for sustaining AI hardware performance amid exponential data growth.

Interconnects and Networking

Interconnects and networking form the backbone of scalable AI hardware systems, facilitating high-speed transfer between processing units, , and nodes to support the massive parallelism required in AI workloads such as distributed and . These technologies span from on-chip networks that enable efficient communication within multi-core AI accelerators to data-center-scale fabrics that connect thousands of GPUs or TPUs, minimizing bottlenecks in movement that can otherwise limit overall system performance. In AI contexts, low-latency and high-bandwidth interconnects are critical for operations like collective communications, where delays in across nodes can significantly extend times for large models. On-chip interconnects, such as Network-on-Chip (NoC) architectures, manage intra-chip data flows in multi-core AI processors by routing traffic between cores, caches, and accelerators via packet-switched networks, reducing contention and improving throughput compared to traditional bus-based designs. A prominent example is NVIDIA's , introduced in 2016 with the Pascal architecture, which provides up to 160 GB/s bidirectional bandwidth in multi-GPU configurations, enabling direct GPU-to-GPU communication that bypasses slower system buses for faster model parallelism. This high-speed linking supports efficient scaling within a single node, such as in DGX systems, where NVLink aggregates bandwidth across multiple GPUs to handle tensor operations with minimal overhead. At the chip-to-chip level, standards like PCIe 5.0 (released in 2021) and PCIe 6.0 (finalized in 2022) deliver aggregate bandwidths of up to 128 GB/s and 256 GB/s bidirectional for x16 configurations, respectively, using advanced signaling to connect AI accelerators to host CPUs and storage in clustered setups. Complementing these, (CXL), announced in 2019, enables cache-coherent memory pooling across devices in AI clusters, allowing dynamic allocation of memory resources to reduce duplication and support disaggregated computing for large-scale . These interconnects integrate with memory systems to route data efficiently, ensuring accelerators access pooled resources without coherence stalls that could degrade training efficiency. Data-center-scale networking relies on fabrics like and Ethernet with (RDMA) to interconnect nodes for distributed AI training. NVIDIA's Quantum-2 platform, launched in 2023, achieves 400 Gb/s per port, supporting in-network computing primitives that offload collective operations to the network, thereby accelerating multi-node workflows in hyperscale environments. Similarly, (RoCE) enables low-overhead data transfers in Ethernet-based clusters, as deployed by Meta for scaling AI training across thousands of GPUs with reduced CPU involvement and near-linear performance gains. In AI applications, these technologies reduce latency in all-reduce operations—essential for gradient synchronization in distributed training—by up to 50% compared to standard Ethernet, allowing models with trillions of parameters to train in hours rather than days. They also enhance power efficiency in hyperscale setups by minimizing idle times and optimizing data paths, potentially cutting energy use by 20-30% in large clusters through reduced retransmissions and lower protocol overhead. Emerging optical interconnects, particularly , promise to address bandwidth and power limitations in exascale AI systems by transmitting data via light over waveguides, achieving terabit-per-second speeds with lower attenuation than electrical links. Prototypes in the 2020s, such as co-packaged optics, have demonstrated 800 Gb/s ports with up to 70% lower power consumption compared to traditional pluggable optics. These advancements are pivotal for sustaining Moore's Law-like scaling in AI hardware, where electrical interconnects increasingly bottleneck performance at exascale.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.