Recent from talks
Nothing was collected or created yet.
Hardware for artificial intelligence
View on WikipediaThis article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
Specialized computer hardware is often used to execute artificial intelligence (AI) programs faster, and with less energy, such as Lisp machines, neuromorphic engineering, event cameras, and physical neural networks. Since 2017, several consumer grade CPUs and SoCs have on-die NPUs. As of 2023, the market for AI hardware is dominated by GPUs.[1]
As of the 2020s, AI computation is dominated by graphics processing units (GPUs) and newer domain-specific accelerators such as Google’s Tensor Processing Units (TPUs), AMD’s Instinct MI300 series, and various on-device neural-processing units (NPUs) found in consumer hardware.[2][3]
Scope
[edit]For the purposes of this article, AI hardware refers to computing components and systems specifically designed or optimized to accelerate artificial-intelligence workloads such as machine-learning training or inference. This includes general-purpose accelerators used for AI (for example, GPUs) and domain-specific accelerators (for example, TPUs, NPUs, and other AI ASICs).[4]
Event-based cameras are sometimes discussed in the context of neuromorphic computing, but they are input sensors rather than AI compute devices. Conversely, components such as memristors are basic circuit elements rather than specialized AI hardware when considered alone.[5][6]
Lisp machines
[edit]
Lisp machines were developed in the late 1970s and early 1980s to make artificial intelligence programs written in the programming language Lisp run faster.
Dataflow architecture
[edit]Dataflow architecture processors used for AI serve various purposes with varied implementations like the polymorphic dataflow[7] Convolution Engine[8] by Kinara (formerly Deep Vision), structure-driven dataflow by Hailo,[9] and dataflow scheduling by Cerebras.[10]
Component hardware
[edit]AI accelerators
[edit]
Since the 2010s, advances in computer hardware have led to more efficient methods for training deep neural networks that contain many layers of non-linear hidden units and a very large output layer.[11] By 2019, graphics processing units (GPUs), often with AI-specific enhancements, had displaced central processing units (CPUs) as the dominant means to train large-scale commercial cloud AI.[12] OpenAI estimated the hardware compute used in the largest deep learning projects from Alex Net (2012) to Alpha Zero (2017), and found a 300,000-fold increase in the amount of compute needed, with a doubling-time trend of 3.4 months.[13][14]
General-purpose GPUs for AI
[edit]Since the 2010s, graphics processing units (GPUs) have been widely used to train and deploy deep learning models because of their highly parallel architecture and high memory bandwidth. Modern data-center GPUs include dedicated tensor or matrix-math units that accelerate neural-network operations.
In 2022, NVIDIA introduced the Hopper-generation H100 GPU, adding FP8 precision support and faster interconnects for large-scale model training.[15] AMD and other vendors have also developed GPUs and accelerators aimed at AI and high-performance computing workloads.[16]
Domain-specific accelerators (ASICs / NPUs)
[edit]Beyond general-purpose GPUs, several companies have developed application-specific integrated circuits (ASICs) and neural processing units (NPUs) tailored for AI workloads. Google introduced the Tensor Processing Unit (TPU) in 2016 for deep-learning inference, with later generations supporting large-scale training through dense systolic-array designs and optical interconnects.[17] Other vendors have released similar devices—such as Apple’s Neural Engine and various on-device NPUs—that emphasize energy-efficient inference in mobile or edge computing environments.[18]
Memory and interconnects
[edit]AI accelerators rely on fast memory and inter-chip links to manage the large data volumes of training and inference. High-bandwidth memory (HBM) stacks, standardized as HBM3 in 2023, provide terabytes-per-second throughput on modern GPUs and ASICs.[19] These accelerators are often connected through dedicated fabrics such as NVIDIA’s NVLink and NVSwitch or optical interconnects used in TPU systems to scale performance across thousands of chips.[20]
Sources
[edit]- ^ "Nvidia: The chip maker that became an AI superpower". BBC News. 25 May 2023. Retrieved 18 June 2023.
- ^ "NVIDIA H100 Tensor Core GPU Architecture Whitepaper". NVIDIA. 2022. Retrieved 4 November 2025.
- ^ "Google Cloud TPU v5 Announcement". Google Cloud Blog. 2023. Retrieved 4 November 2025.
- ^ Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017). "Efficient Processing of Deep Neural Networks: A Tutorial and Survey". Proceedings of the IEEE. 105 (12): 2295–2329. doi:10.1109/JPROC.2017.2761740. Retrieved 4 November 2025.
- ^ Gallego, Guillermo (2022). "Event-based Vision: A Survey" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2020.3008413. Retrieved 4 November 2025.
- ^ Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. S. (2008). "The Missing Memristor Found". Nature. 453: 80–83. doi:10.1038/nature06932. Retrieved 4 November 2025.
- ^ Maxfield, Max (24 December 2020). "Say Hello to Deep Vision's Polymorphic Dataflow Architecture". Electronic Engineering Journal. Techfocus media.
- ^ "Kinara (formerly Deep Vision)". Kinara. 2022. Retrieved 2022-12-11.
- ^ "Hailo". Hailo. Retrieved 2022-12-11.
- ^ Lie, Sean (29 August 2022). Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning. Cerebras (Report). Archived from the original on 15 March 2024. Retrieved 13 December 2022.
- ^ Research, AI (23 October 2015). "Deep Neural Networks for Acoustic Modeling in Speech Recognition". AIresearch.com. Retrieved 23 October 2015.
- ^ Kobielus, James (27 November 2019). "GPUs Continue to Dominate the AI Accelerator Market for Now". InformationWeek. Retrieved 11 June 2020.
- ^ Tiernan, Ray (2019). "AI is changing the entire nature of compute". ZDNet. Retrieved 11 June 2020.
- ^ "AI and Compute". OpenAI. 16 May 2018. Retrieved 11 June 2020.
- ^ "NVIDIA H100 Tensor Core GPU Architecture". NVIDIA. 2022. Retrieved 4 November 2025.
- ^ "AMD Instinct MI300X Accelerator". AMD. 2024. Retrieved 4 November 2025.
- ^ "Introducing Cloud TPU v5p and the AI Hypercomputer". Google Cloud Blog. 6 December 2023. Retrieved 4 November 2025.
- ^ "Apple Neural Engine". Apple Machine Learning Research. Retrieved 4 November 2025.
- ^ "JESD238A: High Bandwidth Memory (HBM3) Standard". JEDEC. January 2023. Retrieved 4 November 2025.
- ^ "NVIDIA Hopper Architecture In-Depth". NVIDIA Developer Blog. 22 March 2022. Retrieved 4 November 2025.
Hardware for artificial intelligence
View on GrokipediaHistorical Developments
Lisp Machines
Lisp machines were general-purpose computers specifically designed to efficiently execute Lisp programs, emerging as early specialized hardware for artificial intelligence and symbolic computation in the 1970s and 1980s. The concept originated from a 1973 proposal by Peter Deutsch at the MIT Artificial Intelligence Laboratory, where Lisp had been developed in the late 1950s and early 1960s, with the Lisp Machine Project initiated by Richard Greenblatt in 1974. This project produced prototypes like the CONS machine in 1975, followed by the influential CADR machine around 1977–1980, which offered significant performance improvements over general-purpose hardware like the PDP-10 and served as the foundation for commercial efforts. By the early 1980s, former MIT researchers founded companies such as Symbolics in 1980 and Lisp Machines Incorporated (LMI) in 1979, commercializing these designs to support AI research and development.[14][15][16] Key architectural features of Lisp machines were tailored to Lisp's demands for dynamic memory management and list processing, including tagged memory architectures that embedded type information directly in words for rapid type checking and dispatch. Hardware implementations often incorporated microcode to accelerate Lisp primitives such as cons, car, and cdr operations, while dedicated support for garbage collection—via methods like reference counting or mark-and-sweep—minimized pauses in symbolic computation. Virtual memory systems were optimized for handling large, fragmented data structures common in AI applications, and many models featured high-resolution bitmapped displays with early graphical user interfaces to aid interactive programming. These optimizations, building on earlier influences like the SECD machine model from the 1960s, enabled Lisp execution speeds that were significantly faster than on general-purpose hardware of the era, such as the PDP-10.[15][14][17] Notable models included the MIT CADR, which influenced subsequent commercial machines like Symbolics' LM-2 (1980) and the more advanced 3600 series introduced in 1983, capable of handling complex AI workloads with up to 4 MB of memory expandable to 32 MB. LMI's Lambda machine (1983) and Texas Instruments' Explorer series (starting 1983) also drew from CADR designs, offering similar performance for around $70,000–$125,000 per unit and finding adoption in research labs for tasks like symbolic manipulation. Symbolics machines, in particular, peaked in sales with over $100 million in revenue by 1986, underscoring their role in equipping AI researchers with powerful tools.[15][16][17] The decline of Lisp machines began in the late 1980s due to the commoditization of high-performance workstations like Sun Microsystems' models, which offered comparable or superior capabilities for Lisp via software emulation at a fraction of the cost—around $14,000 versus $100,000 for Lisp machines. The AI winter following reduced funding, such as the end of DARPA's Strategic Computing Initiative, further eroded demand, as expert systems and symbolic AI shifted toward more portable implementations on general-purpose hardware. By the early 1990s, companies like Symbolics faced bankruptcy, marking the end of dedicated Lisp hardware production.[16][17] Despite their short commercial lifespan, Lisp machines profoundly impacted AI by enabling efficient development of symbolic systems during the 1970s and 1980s, including expert systems like those built with tools such as Macsyma and early frames-based reasoning. They facilitated rapid prototyping at institutions like MIT, where they supported vision, robotics, and natural language processing research, and influenced the standardization of Common Lisp in 1984. This hardware specialization accelerated advancements in declarative and functional programming paradigms central to early AI, even as later numerical approaches dominated.[14][15]Dataflow Architectures
Dataflow architectures represent a paradigm shift from traditional von Neumann models, where execution is driven by the availability of data operands rather than a sequential control flow dictated by a program counter. In this model, computations are represented as directed graphs in which nodes denote operations and edges indicate data dependencies; an operation fires only when all required inputs are present, enabling inherent parallelism without explicit synchronization. This concept originated from the work of Jack Dennis at MIT in the early 1970s, with foundational ideas outlined in a 1975 paper proposing a basic data-flow processor that emphasized demand-driven evaluation to exploit concurrency in scientific and symbolic computations. Key implementations in the 1980s demonstrated both static and dynamic variants of dataflow models. Static dataflow architectures, as pioneered by Dennis's group, restrict each data arc to a single token at a time, simplifying hardware but limiting concurrency for recursive or iterative tasks. In contrast, dynamic dataflow models, such as those in MIT's Tagged Token Dataflow Architecture (TTDA) and the Manchester Dataflow Machine, use unique tags on tokens to allow multiple instances per arc, supporting higher parallelism at the cost of increased overhead for tag management. The TTDA, developed by Arvind and colleagues, employed a multiprocessor design with actors executing functional code on tagged tokens, while the Manchester machine featured a prototype with 32-bit microprocessors and dynamic tagging for general-purpose parallel processing, operational since 1981.[18][19][20] At the hardware level, dataflow machines incorporate specialized components like token matching units—often content-addressable memories (CAMs)—that store incoming tokens and pair operands with matching destination and iteration tags before dispatching them to execution units. Communication occurs via packet-switched networks that route tokens asynchronously between processing elements, eliminating the need for a global clock and allowing fine-grained parallelism without von Neumann bottlenecks. These designs, typically comprising arrays of simple processors connected through switching fabrics, prioritize data movement to enable massive concurrency in graph-based computations.[21][22] In early AI applications, dataflow architectures facilitated parallel evaluation of logic and functional languages, particularly Prolog variants for nondeterministic search and theorem proving. The Manchester machine, for instance, supported a dataflow implementation of Prolog-like logic programming, where backtracking and unification were modeled as token flows, accelerating AI tasks like expert systems and automated reasoning by distributing search spaces across nodes. Similarly, TTDA's support for functional languages enabled parallel reduction in lambda expressions, aiding symbolic AI computations such as pattern matching and planning algorithms. These systems demonstrated potential for AI workloads with irregular parallelism, though adoption remained limited to research prototypes.[23][24] Despite their theoretical appeal, dataflow architectures faced scalability challenges due to synchronization overhead in token matching and network contention, which degraded performance as the number of processing elements increased beyond dozens. Tag resolution and storage demands also imposed memory penalties, making large-scale implementations inefficient compared to emerging alternatives. Consequently, while direct successors waned by the late 1980s, dataflow principles influenced modern designs like systolic arrays, which adopt structured data flows for matrix operations in AI accelerators, retaining the operand-driven execution but with fixed topologies to mitigate overhead.[21][25]General-Purpose Hardware
Central Processing Units
Central Processing Units (CPUs) form the foundational general-purpose hardware for artificial intelligence (AI) workloads, evolving from single-core designs in the 1990s to multi-core architectures that support parallel processing through software optimizations and hardware extensions tailored for vectorized and matrix-based computations. In the 1990s, x86 processors like Intel's Pentium series operated primarily as single-core systems focused on sequential scalar instructions, which limited their ability to handle the emerging parallel demands of early AI algorithms such as neural network training.[26] By the 2000s, the shift to multi-core designs, exemplified by Intel's Core i-series introduced in 2006, enabled concurrent execution of AI tasks, allowing libraries and frameworks to distribute workloads across cores for improved efficiency in small-scale model training and inference. ARM-based processors, which gained traction for AI in the 2010s due to their energy efficiency, further expanded CPU applicability to edge devices, where power constraints are critical.[27] Key features enhancing CPU suitability for AI include Single Instruction Multiple Data (SIMD) instruction sets, which facilitate vectorized operations on multiple data elements simultaneously. Early SIMD extensions like Streaming SIMD Extensions (SSE) debuted in Intel processors in 1999, followed by Advanced Vector Extensions (AVX) in 2008, and culminating in AVX-512 in 2016, which introduced 512-bit registers capable of processing up to 64 floating-point operations per cycle per core for AI-relevant precisions like FP16.[28] These extensions, particularly AVX-512's Vector Neural Network Instructions (VNNI), accelerate deep learning primitives such as convolutions and matrix multiplications by reducing instruction overhead.[29] CPU cache hierarchies have also been refined with larger L3 caches and prefetching mechanisms to minimize data movement latency during matrix operations, a common bottleneck in AI. Complementing these hardware advances, software libraries like OpenBLAS provide optimized implementations of Basic Linear Algebra Subprograms (BLAS), leveraging multi-core parallelism and SIMD to execute AI building blocks such as general matrix multiplication (GEMM) on CPUs.[30] CPUs play a vital role in AI inference on resource-constrained edge devices, where their low-latency sequential processing suits real-time applications, and in training compact models that do not require massive parallelism; they often operate in hybrid configurations, handling data preprocessing and control flow alongside specialized accelerators such as GPUs. High-core count CPUs with large caches, such as those featuring AMD's 3D V-Cache technology, improve data preparation and AI simulations by enhancing cache-heavy operations and multitasking efficiency.[31] In GPU-accelerated AI workloads, CPUs primarily manage data loading/preprocessing, orchestration, and system tasks, while a weak CPU can cause minor bottlenecks in data pipelines but typically does not drastically limit high-end GPU performance in most consumer or local AI setups.[32] For running smaller AI models locally, a powerful CPU on standard laptops is sufficient, though performance is significantly slower than on GPUs.[33][34] Performance metrics highlight this niche: a single modern CPU core with AVX-512 can deliver up to 2 TFLOPS in FP16 for AI workloads, scaling to tens of TFLOPS across multi-core systems, as demonstrated in benchmarks for inference tasks like image classification.[35] Despite these capabilities, CPUs face limitations in AI scalability, offering lower parallelism than GPUs—typically 10-100x fewer cores optimized for independent threads—and higher power consumption per operation for dense tensor computations, often exceeding 10-20 pJ per operation compared to GPU efficiencies.[36] GPUs complement CPUs by handling high-throughput parallel tasks in training pipelines.[36] Contemporary examples underscore CPU adaptations for AI: AMD's EPYC processors, such as the 9005 series, target server-side inference with up to 192 cores, 12-channel DDR5 memory support, and AVX-512, delivering up to 37% higher AI throughput per generation for diverse model sizes.[37] Similarly, Apple's M-series chips, starting with the M1 in 2020, integrate ARM-based multi-core CPUs with a dedicated Neural Engine co-processor on a unified system-on-chip, enabling efficient on-device AI inference—such as in Siri and image processing—while the CPU manages general tasks, achieving up to 11 TOPS total for the system with low power draw.[38]Graphics Processing Units
Graphics processing units (GPUs) were originally designed for rendering complex graphics in video games and simulations, but their architecture proved highly suitable for accelerating artificial intelligence workloads due to its emphasis on parallel processing. NVIDIA's GeForce 256, released in 1999, marked the first GPU with dedicated hardware for 3D transformations and lighting, laying the groundwork for parallel computation beyond graphics. This evolved significantly with the introduction of CUDA in 2006, a parallel computing platform that enabled general-purpose computing on GPUs (GPGPU), allowing developers to leverage GPU power for non-graphics tasks like scientific simulations and, later, AI model training. At their core, modern GPUs feature thousands of smaller cores organized in single instruction, multiple data (SIMD) arrays, enabling massive parallelism for operations such as matrix multiplications central to neural networks. NVIDIA's Volta architecture, launched in 2017, introduced tensor cores—specialized hardware units optimized for mixed-precision computations in deep learning, accelerating operations like matrix multiply-accumulate in formats such as FP16 and INT8. Building on this, the Ampere architecture in 2020 incorporated unified memory models, allowing seamless data sharing between CPU and GPU without explicit transfers, which reduces latency in AI pipelines. In AI applications, GPUs have become dominant for training deep learning models, powering frameworks like TensorFlow and PyTorch that abstract GPU kernels for efficient parallel execution of backpropagation and convolutions. For running local AI models, older generations of GPUs serve as accessible options for inference, particularly with quantized models that enable efficient performance on limited hardware, while single-board computers provide platforms for edge inference; GPUs lead in inference workloads due to their parallel processing capabilities, with NVIDIA as the primary market leader holding approximately 85% of the overall AI accelerator market share (including GPUs and other types) in Q2 2025, about 92% of the discrete GPU market, and over 80% of the broader AI hardware/accelerator market, while competitors such as AMD (around 2% in AI accelerators) and Intel hold significantly smaller shares and custom ASICs (e.g., Broadcom at around 10%) compete in specific inference segments.[39][40] an NVIDIA GPU with at least 8-16GB VRAM is recommended for efficient performance with smaller to medium-sized models, though larger models with 70B+ parameters typically require 24GB+ VRAM or multiple GPUs for full-precision equivalents.[41][42] For instance, high-end GPUs deliver peak performance exceeding 300 TFLOPS in FP16 precision, enabling faster training of large models compared to traditional CPUs. This parallelism is particularly effective for handling the matrix-heavy computations in neural networks, often integrated with CPUs for sequential tasks in hybrid systems. Key advancements have further tailored GPUs for AI scalability in data centers. NVIDIA's A100 GPU, released in 2020, combines tensor cores with high-bandwidth memory (HBM2e) to support multi-terabyte-scale models, achieving up to 624 TOPS in INT8 for inference workloads (dense tensor operations).[43] Additionally, multi-instance GPU (MIG) technology, introduced in the same architecture, partitions a single GPU into isolated instances for concurrent workloads, improving resource utilization in cloud environments. Subsequent architectures, such as Hopper with the H100 GPU released in 2022, deliver up to 989 TFLOPS in FP16 tensor performance, while the Blackwell architecture, launched in 2024 with the B200 GPU, achieves up to 20 petaFLOPS of AI performance, enhancing efficiency for large-scale training and inference as of 2025.[44][45] Despite these strengths, GPUs face challenges in AI hardware, including memory bandwidth limitations that can bottleneck data movement for very large models, necessitating techniques like model parallelism. Programming complexity also persists, as developers must write custom CUDA kernels to optimize performance, which requires expertise in low-level parallel programming.Specialized AI Accelerators
Tensor Processing Units
Tensor Processing Units (TPUs) are custom-designed application-specific integrated circuits (ASICs) developed by Google to accelerate machine learning workloads, particularly those involving tensor operations such as matrix multiplications in neural networks. Optimized for high throughput and energy efficiency, TPUs integrate seamlessly with frameworks like TensorFlow via the XLA compiler, which translates high-level computations into low-level instructions for the hardware. This specialization enables low-latency inference and scalable training, surpassing general-purpose processors in AI-specific tasks. Google began deploying TPUs internally in 2015 to handle the growing demands of its AI services, such as image recognition in Photos and neural machine translation. The first generation focused on inference, with public announcement in 2017 and availability through Google Cloud in early 2018. Subsequent generations expanded to training capabilities, with TPUs powering over 100,000 units across Google's data centers by the late 2010s. At the core of TPU architecture is a systolic array design, which efficiently performs dense matrix multiplications by streaming data through a grid of processing elements, minimizing memory accesses and power consumption. For instance, the TPU v1 features a 256×256 systolic array capable of 92 tera-operations per second (TOPS) in 8-bit integer precision. Later versions build on this with enhancements like liquid cooling, optical interconnects, sparsity support, 3D-stacked high-bandwidth memory (HBM3), and up to 4x improved interconnect bandwidth (9,216 Gb/s) to handle larger models and pods scaling to 8,960 chips. As of November 2025, recent generations include v5e (efficiency-focused, 197 TFLOPS BF16 per chip), v5p (performance variant, 459 TFLOPS BF16), v6e/Trillium (optimized for large language models, 926 TFLOPS BF16), and v7 Ironwood (inference-optimized with FP8 support, approximately 4,614 TFLOPS FP8 per chip).[46] Key TPU versions include:| Version | Release Year | Key Features |
|---|---|---|
| TPU v1 | 2015 (internal), 2017 (announced) | Inference-focused; 92 TOPS (INT8); systolic array for matrix ops.[47] |
| TPU v2 | 2017 | Added training support; 180 teraFLOPS (BF16); first pods with 256 chips.[48] |
| TPU v3 | 2018 | Pod-scale with liquid cooling; 420 teraFLOPS (BF16); 8x faster than v2.[49] |
| TPU v4 | 2021 | Enhanced sparsity and optical switches; 275 teraFLOPS (BF16/INT8); 32 GiB HBM2 memory.[50] |
| TPU v5e | 2023 | Efficiency variant; 197 teraFLOPS (BF16), 393 TOPS (INT8); 16 GiB HBM; improved perf/watt. |
| TPU v5p | 2024 | Performance variant; 459 teraFLOPS (BF16); HBM3 memory; 2x v5e throughput. |
| TPU v6e (Trillium) | 2024 | LLM-optimized; 926 teraFLOPS (BF16); advanced sparsity; pods up to 8,960 chips. |
| TPU v7 (Ironwood) | 2025 (announced) | Inference-focused; ~4,614 teraFLOPS (FP8); 192 GiB HBM; real-time AI support.[51] |
| Edge TPU | 2019 | Mobile/edge inference; 4 TOPS (INT8); integrated in Coral devices for IoT. |
