Hubbry Logo
Graphics processing unitGraphics processing unitMain
Open search
Graphics processing unit
Community hub
Graphics processing unit
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Graphics processing unit
Graphics processing unit
from Wikipedia
Components of a GPU

A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a component on a discrete graphics card or embedded on motherboards, mobile phones, personal computers, workstations, and game consoles. GPUs were later found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. The ability of GPUs to rapidly perform vast numbers of calculations has led to their adoption in diverse fields including artificial intelligence (AI) where they excel at handling data-intensive and computationally demanding tasks. Other non-graphical uses include the training of neural networks and cryptocurrency mining.

History

[edit]

1970s

[edit]

Arcade system boards have used specialized graphics circuits since the 1970s. In early video game hardware, RAM for frame buffers was expensive, so video chips composited data together as the display was being scanned out on the monitor.[1]

A specialized barrel shifter circuit helped the CPU animate the framebuffer graphics for various 1970s arcade video games from Midway and Taito, such as Gun Fight (1975), Sea Wolf (1976), and Space Invaders (1978).[2] The Namco Galaxian arcade system in 1979 used specialized graphics hardware that supported RGB color, multi-colored sprites, and tilemap backgrounds.[3] The Galaxian hardware was widely used during the golden age of arcade video games, by game companies such as Namco, Centuri, Gremlin, Irem, Konami, Midway, Nichibutsu, Sega, and Taito.[4]

Atari ANTIC microprocessor on an Atari 130XE motherboard

The Atari 2600 in 1977 used a video shifter called the Television Interface Adaptor.[5] Atari 8-bit computers (1979) had ANTIC, a video processor which interpreted instructions describing a "display list"—the way the scan lines map to specific bitmapped or character modes and where the memory is stored (so there did not need to be a contiguous frame buffer).[clarification needed][6] 6502 machine code subroutines could be triggered on scan lines by setting a bit on a display list instruction.[clarification needed][7] ANTIC also supported smooth vertical and horizontal scrolling independent of the CPU.[8]

1980s

[edit]
NEC μPD7220A

The NEC μPD7220 was the first implementation of a personal computer graphics display processor as a single large-scale integration (LSI) integrated circuit chip. This enabled the design of low-cost, high-performance video graphics cards such as those from Number Nine Visual Technology. It became the best-known GPU until the mid-1980s.[9] It was the first fully integrated VLSI (very large-scale integration) metal–oxide–semiconductor (NMOS) graphics display processor for PCs, supported up to 1024×1024 resolution, and laid the foundations for the PC graphics market. It was used in a number of graphics cards and was licensed for clones such as the Intel 82720, the first of Intel's graphics processing units.[10] The Williams Electronics arcade games Robotron: 2084, Joust, Sinistar, and Bubbles, all released in 1982, contain custom blitter chips for operating on 16-color bitmaps.[11][12]

In 1984, Hitachi released the ARTC HD63484, the first major CMOS graphics processor for personal computers. The ARTC could display up to 4K resolution when in monochrome mode. It was used in a number of graphics cards and terminals during the late 1980s.[13] In 1985, the Amiga was released with a custom graphics chip including a blitter for bitmap manipulation, line drawing, and area fill. It also included a coprocessor with its own simple instruction set, that was capable of manipulating graphics hardware registers in sync with the video beam (e.g. for per-scanline palette switches, sprite multiplexing, and hardware windowing), or driving the blitter. In 1986, Texas Instruments released the TMS34010, the first fully programmable graphics processor.[14] It could run general-purpose code but also had a graphics-oriented instruction set. During 1990–1992, this chip became the basis of the Texas Instruments Graphics Architecture ("TIGA") Windows accelerator cards.

The IBM 8514 Micro Channel adapter, with memory add-on

In 1987, the IBM 8514 graphics system was released. It was one of the first video cards for IBM PC compatibles that implemented fixed-function 2D primitives in electronic hardware. Sharp's X68000, released in 1987, used a custom graphics chipset[15] with a 65,536 color palette and hardware support for sprites, scrolling, and multiple playfields.[16] It served as a development machine for Capcom's CP System arcade board. Fujitsu's FM Towns computer, released in 1989, had support for a 16,777,216 color palette.[17] In 1988, the first dedicated polygonal 3D graphics boards were introduced in arcades with the Namco System 21[18] and Taito Air System.[19]

VGA section on the motherboard in IBM PS/55

IBM introduced its proprietary Video Graphics Array (VGA) display standard in 1987, with a maximum resolution of 640×480 pixels. In November 1988, NEC Home Electronics announced its creation of the Video Electronics Standards Association (VESA) to develop and promote a Super VGA (SVGA) computer display standard as a successor to VGA. Super VGA enabled graphics display resolutions up to 800×600 pixels, a 56% increase.[20]

1990s

[edit]
Tseng Labs ET4000/W32p
S3 Graphics ViRGE
Voodoo3 2000 AGP card

In 1991, S3 Graphics introduced the S3 86C911, which its designers named after the Porsche 911 as an indication of the performance increase it promised.[21] The 86C911 spawned a variety of imitators: by 1995, all major PC graphics chip makers had added 2D acceleration support to their chips.[22] Fixed-function Windows accelerators surpassed expensive general-purpose graphics coprocessors in Windows performance, and such coprocessors faded from the PC market.

In the early- and mid-1990s, real-time 3D graphics became increasingly common in arcade, computer, and console games, which led to increasing public demand for hardware-accelerated 3D graphics. Early examples of mass-market 3D graphics hardware can be found in arcade system boards such as the Sega Model 1, Namco System 22, and Sega Model 2, and the fifth-generation video game consoles such as the Saturn, PlayStation, and Nintendo 64. Arcade systems such as the Sega Model 2 and SGI Onyx-based Namco Magic Edge Hornet Simulator in 1993 were capable of hardware T&L (transform, clipping, and lighting) years before appearing in consumer graphics cards.[23][24] Another early example is the Super FX chip, a RISC-based on-cartridge graphics chip used in some SNES games, notably Doom and Star Fox. Some systems used DSPs to accelerate transformations. Fujitsu, which worked on the Sega Model 2 arcade system,[25] began working on integrating T&L into a single LSI solution for use in home computers in 1995;[26] the Fujitsu Pinolite, the first 3D geometry processor for personal computers, announced in 1997.[27] The first hardware T&L GPU on home video game consoles was the Nintendo 64's Reality Coprocessor, released in 1996.[28] In 1997, Mitsubishi released the 3Dpro/2MP, a GPU capable of transformation and lighting, for workstations and Windows NT desktops;[29] ATi used it for its FireGL 4000 graphics card, released in 1997.[30]

The term "GPU" was coined by Sony in reference to the 32-bit Sony GPU (designed by Toshiba) in the PlayStation video game console, released in 1994.[31]

2000s

[edit]

In October 2002, with the introduction of the ATI Radeon 9700 (also known as R300), the world's first Direct3D 9.0 accelerator, pixel and vertex shaders could implement looping and lengthy floating point math, and were quickly becoming as flexible as CPUs, yet orders of magnitude faster for image-array operations. Pixel shading is often used for bump mapping, which adds texture to make an object look shiny, dull, rough, or even round or extruded.[32]

With the introduction of the Nvidia GeForce 8 series and new generic stream processing units, GPUs became more generalized computing devices. Parallel GPUs are making computational inroads against the CPU, and a subfield of research, dubbed GPU computing or GPGPU for general purpose computing on GPU, has found applications in fields as diverse as machine learning,[33] oil exploration, scientific image processing, linear algebra,[34] statistics,[35] 3D reconstruction, and stock options pricing. GPGPUs were the precursors to what is now called a compute shader (e.g. CUDA, OpenCL, DirectCompute) and actually abused the hardware to a degree by treating the data passed to algorithms as texture maps and executing algorithms by drawing a triangle or quad with an appropriate pixel shader.[clarification needed] This entails some overheads since units like the scan converter are involved where they are not needed (nor are triangle manipulations even a concern—except to invoke the pixel shader).[clarification needed]

Nvidia's CUDA platform, first introduced in 2007,[36] was the earliest widely adopted programming model for GPU computing. OpenCL is an open standard defined by the Khronos Group that allows for the development of code for both GPUs and CPUs with an emphasis on portability.[37] OpenCL solutions are supported by Intel, AMD, Nvidia, and ARM, and according to a report in 2011 by Evans Data, OpenCL had become the second most popular HPC tool.[38]

2010s

[edit]

In 2010, Nvidia partnered with Audi to power their cars' dashboards, using the Tegra GPU to provide increased functionality to cars' navigation and entertainment systems.[39] Advances in GPU technology in cars helped advance self-driving technology.[40] AMD's Radeon HD 6000 series cards were released in 2010, and in 2011 AMD released its 6000M Series discrete GPUs for mobile devices.[41] The Kepler line of graphics cards by Nvidia were released in 2012 and were used in the Nvidia 600 and 700 series cards. A feature in this GPU microarchitecture included GPU boost, a technology that adjusts the clock-speed of a video card to increase or decrease according to its power draw.[42] The Kepler microarchitecture was manufactured.

The PS4 and Xbox One were released in 2013; they both used GPUs based on AMD's Radeon HD 7850 and 7790.[43] Nvidia's Kepler line of GPUs was followed by the Maxwell line, manufactured on the same process. Nvidia's 28 nm chips were manufactured by TSMC in Taiwan using the 28 nm process. Compared to the 40 nm technology from the past, this manufacturing process allowed a 20 percent boost in performance while drawing less power.[44][45] Virtual reality headsets have high system requirements; manufacturers recommended the GTX 970 and the R9 290X or better at the time of their release.[46][47] Cards based on the Pascal microarchitecture were released in 2016. The GeForce 10 series of cards are of this generation of graphics cards. They are made using the 16 nm manufacturing process which improves upon previous microarchitectures.[48]

In 2018, Nvidia launched the RTX 20 series GPUs that added ray tracing cores to GPUs, improving their performance on lighting effects.[49] Polaris 11 and Polaris 10 GPUs from AMD are fabricated by a 14 nm process. Their release resulted in a substantial increase in the performance per watt of AMD video cards.[50] AMD also released the Vega GPU series for the high end market as a competitor to Nvidia's high end Pascal cards, also featuring HBM2 like the Titan V.

In 2019, AMD released the successor to their Graphics Core Next (GCN) microarchitecture/instruction set. Dubbed RDNA, the first product featuring it was the Radeon RX 5000 series of video cards.[51] The company announced that the successor to the RDNA microarchitecture would be incremental (a "refresh"). AMD unveiled the Radeon RX 6000 series, its RDNA 2 graphics cards with support for hardware-accelerated ray tracing.[52] The product series, launched in late 2020, consisted of the RX 6800, RX 6800 XT, and RX 6900 XT.[53][54] The RX 6700 XT, which is based on Navi 22, was launched in early 2021.[55]

The PlayStation 5 and Xbox Series X and Series S were released in 2020; they both use GPUs based on the RDNA 2 microarchitecture with incremental improvements and different GPU configurations in each system's implementation.[56][57][58]

2020s

[edit]

In the 2020s, GPUs have been increasingly used for calculations involving embarrassingly parallel problems, such as training of neural networks on enormous datasets that are needed for artificial intelligence large language models. Specialized processing cores on some modern workstation's GPUs are dedicated for deep learning since they have significant FLOPS performance increases, using 4×4 matrix multiplication and division, resulting in hardware performance up to 128 TFLOPS in some applications.[59] These tensor cores are expected to appear in consumer cards, as well.[needs update][60]

GPU companies

[edit]

Many companies have produced GPUs under a number of brand names. In 2009,[needs update] Intel, Nvidia, and AMD/ATI were the market share leaders, with 49.4%, 27.8%, and 20.6% market share respectively. In addition, Matrox[61] produces GPUs. Chinese companies such as Jingjia Micro have also produced GPUs for the domestic market although in terms of worldwide sales, they lag behind market leaders.[62]

Computational functions

[edit]

Several factors of GPU construction affect the performance of the card for real-time rendering, such as the size of the connector pathways in the semiconductor device fabrication, the clock signal frequency, and the number and size of various on-chip memory caches. Performance is also affected by the number of streaming multiprocessors (SM) for NVidia GPUs, or compute units (CU) for AMD GPUs, or Xe cores for Intel discrete GPUs, which describe the number of on-silicon processor core units within the GPU chip that perform the core calculations, typically working in parallel with other SM/CUs on the GPU. GPU performance is typically measured in floating point operations per second (FLOPS); GPUs in the 2010s and 2020s typically deliver performance measured in teraflops (TFLOPS). This is an estimated performance measure, as other factors can affect the actual display rate.[63]

The ATI HD5470 GPU (above, with copper heatpipe attached) features UVD 2.1 which enables it to decode AVC and VC-1 video formats.

2D graphics APIs

[edit]

An earlier GPU may support one or more 2D graphics APIs for 2D acceleration, such as GDI and DirectDraw.[64]

GPU forms

[edit]

Terminology

[edit]

In the 1970s, the term "GPU" originally stood for graphics processor unit and described a programmable processing unit working independently from the CPU that was responsible for graphics manipulation and output.[65][66] In 1994, Sony used the term (now standing for graphics processing unit) in reference to the PlayStation console's Toshiba-designed Sony GPU.[31] The term was popularized by Nvidia in 1999, who marketed the GeForce 256 as "the world's first GPU".[67] It was presented as a "single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines".[68] Rival ATI Technologies coined the term "visual processing unit" or VPU with the release of the Radeon 9700 in 2002.[69] The AMD Alveo MA35D features dual VPU’s, each using the 5 nm process in 2023.[70]

In personal computers, there are two main forms of GPUs. Each has many synonyms:[71]

Dedicated graphics processing unit

[edit]

Dedicated graphics processing units use RAM that is dedicated to the GPU rather than relying on the computer’s main system memory. This RAM is usually specially selected for the expected serial workload of the graphics card (see GDDR). Sometimes systems with dedicated discrete GPUs were called "DIS" systems as opposed to "UMA" systems (see next section).[72]

Technologies such as Scan-Line Interleave by 3dfx, SLI and NVLink by Nvidia and CrossFire by AMD allow multiple GPUs to draw images simultaneously for a single screen, increasing the processing power available for graphics. These technologies, however, are increasingly uncommon; most games do not fully use multiple GPUs, as most users cannot afford them.[73][74][75] Multiple GPUs are still used on supercomputers (like in Summit), on workstations to accelerate video (processing multiple videos at once)[76][77][78] and 3D rendering,[79] for VFX,[80] GPGPU workloads and for simulations,[81] and in AI to expedite training, as is the case with Nvidia's lineup of DGX workstations and servers, Tesla GPUs, and Intel's Ponte Vecchio GPUs.

Integrated graphics processing unit

[edit]
The position of an integrated GPU in a northbridge/southbridge system layout
An ASRock motherboard with integrated graphics, which has HDMI, VGA and DVI-out ports

Integrated graphics processing units (IGPU), integrated graphics, shared graphics solutions, integrated graphics processors (IGP), or unified memory architectures (UMA) use a portion of a computer's system RAM rather than dedicated graphics memory. IGPs can be integrated onto a motherboard as part of its northbridge chipset,[82] or on the same die (integrated circuit) with the CPU (like AMD APU or Intel HD Graphics). On certain motherboards,[83] AMD's IGPs can use dedicated sideport memory: a separate fixed block of high performance memory that is dedicated for use by the GPU. As of early 2007, computers with integrated graphics account for about 90% of all PC shipments.[84][needs update] They are less costly to implement than dedicated graphics processing, but tend to be less capable. Historically, integrated processing was considered unfit for 3D games or graphically intensive programs but could run less intensive programs such as Adobe Flash. Examples of such IGPs would be offerings from SiS and VIA circa 2004.[85] However, modern integrated graphics processors such as AMD Accelerated Processing Unit and Intel Graphics Technology (HD, UHD, Iris, Iris Pro, Iris Plus, and Xe-LP) can handle 2D graphics or low-stress 3D graphics.

Since GPU computations are memory-intensive, integrated processing may compete with the CPU for relatively slow system RAM, as it has minimal or no dedicated video memory. IGPs use system memory with bandwidth up to a current maximum of 128 GB/s, whereas a discrete graphics card may have a bandwidth[86] of more than 1000 GB/s between its VRAM and GPU core. This memory bus bandwidth can limit the performance of the GPU, though multi-channel memory can mitigate this deficiency.[87] Older integrated graphics chipsets lacked hardware transform and lighting, but newer ones include it.[88][89]

On systems with "Unified Memory Architecture" (UMA), including modern AMD processors with integrated graphics,[90] modern Intel processors with integrated graphics,[91] Apple processors, the PS5 and Xbox Series (among others), the CPU cores and the GPU block share the same pool of RAM and memory address space.

Stream processing and general purpose GPUs (GPGPU)

[edit]

It is common to use a general purpose graphics processing unit (GPGPU) as a modified form of stream processor (or a vector processor), running compute kernels. This turns the massive computational power of a modern graphics accelerator's shader pipeline into general-purpose computing power. In certain applications requiring massive vector operations, this can yield several orders of magnitude higher performance than a conventional CPU. The two largest discrete (see "Dedicated graphics processing unit" above) GPU designers, AMD and Nvidia, are pursuing this approach with an array of applications. Both Nvidia and AMD teamed with Stanford University to create a GPU-based client for the Folding@home distributed computing project for protein folding calculations. In certain circumstances, the GPU calculates forty times faster than the CPUs traditionally used by such applications.[92][93]

GPU-based high performance computers play a significant role in large-scale modelling. Three of the ten most powerful supercomputers in the world take advantage of GPU acceleration.[94]

Since 2005 there has been interest in using the performance offered by GPUs for evolutionary computation in general, and for accelerating the fitness evaluation in genetic programming in particular. Most approaches compile linear or tree programs on the host PC and transfer the executable to the GPU to be run. Typically a performance advantage is only obtained by running the single active program simultaneously on many example problems in parallel, using the GPU's SIMD architecture.[95] However, substantial acceleration can also be obtained by not compiling the programs, and instead transferring them to the GPU, to be interpreted there.[96]

External GPU (eGPU)

[edit]

Therefore, it is desirable to attach a GPU to some external bus of a notebook. PCI Express is the only bus used for this purpose. The port may be, for example, an ExpressCard or mPCIe port (PCIe ×1, up to 5 or 2.5 Gbit/s respectively), a Thunderbolt 1, 2, or 3 port (PCIe ×4, up to 10, 20, or 40 Gbit/s respectively), a USB4 port with Thunderbolt compatibility, or an OCuLink port. Those ports are only available on certain notebook systems.[97] eGPU enclosures include their own power supply (PSU), because powerful GPUs can consume hundreds of watts.[98]

Energy efficiency

[edit]

Graphics processing units (GPU) have continued to increase in energy usage, while CPUs designers have recently[when?] focused on improving performance per watt. High performance GPUs may draw large amount of power, therefore intelligent techniques are required to manage GPU power consumption. Measures like 3DMark2006 score per watt can help identify more efficient GPUs.[99] However that may not adequately incorporate efficiency in typical use, where much time is spent doing less demanding tasks.[100]

With modern GPUs, energy usage is an important constraint on the maximum computational capabilities that can be achieved. GPU designs are usually highly scalable, allowing the manufacturer to put multiple chips on the same video card, or to use multiple video cards that work in parallel. Peak performance of any system is essentially limited by the amount of power it can draw and the amount of heat it can dissipate. Consequently, performance per watt of a GPU design translates directly into peak performance of a system that uses that design.

Since GPUs may also be used for some general purpose computation, sometimes their performance is measured in terms also applied to CPUs, such as FLOPS per watt.

Sales

[edit]

In 2013, 438.3 million GPUs were shipped globally and the forecast for 2014 was 414.2 million. However, by the third quarter of 2022, shipments of PC GPUs totaled around 75.5 million units, down 19% year-over-year.[101][needs update][102]

See also

[edit]

Hardware

[edit]

APIs

[edit]

Applications

[edit]

References

[edit]

Sources

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A graphics processing unit (GPU) is a specialized designed to accelerate the creation of images in a frame buffer for output to a by rapidly manipulating and altering memory through parallel processing of graphical data. Originally developed to handle the high computational demands of real-time 3D rendering in video games and visual applications, GPUs consist of thousands of smaller, efficient cores optimized for simultaneous execution of many floating-point or operations, contrasting with the sequential focus of central processing units (CPUs). The first commercial GPU, NVIDIA's GeForce 256 released in 1999, integrated 3D graphics capabilities into a single chip, marking the shift from separate fixed-function hardware to programmable architectures that could handle vertex and pixel shading through shaders. Over the subsequent decades, advancements in GPU design—such as the introduction of unified shader models in NVIDIA's GeForce 8 series (2006) and AMD's Radeon HD 2000 series—enabled greater flexibility, allowing the same processing units to handle diverse workloads beyond graphics. As of 2025, GPUs deliver peak performance exceeding hundreds of teraflops in high-end models, with architectures like NVIDIA's Blackwell and Rubin series or AMD's RDNA 4 incorporating features such as ray tracing hardware and tensor cores for enhanced efficiency in both rendering and compute tasks. Beyond traditional graphics, GPUs have become essential for general-purpose computing on graphics processing units (GPGPU), powering applications in , , scientific simulations, and clusters that rank among the world's fastest supercomputers. This expansion stems from their ability to process massive datasets in parallel, offloading intensive workloads from CPUs to achieve up to 100x speedups in data-parallel algorithms. Key enablers include programming models like NVIDIA's (introduced in 2006) and (released in 2009), which allow developers to leverage GPU compute power without deep graphics expertise. In safety-critical domains such as autonomous vehicles and , GPUs integrate with systems requiring high-throughput parallel execution while addressing challenges like hardware .

Definition and Fundamentals

Core Concept and Purpose

A graphics processing unit (GPU) is a specialized designed to rapidly manipulate and alter to accelerate the creation of images in a frame buffer intended for output to a . This hardware excels in handling the intensive computational demands of visual rendering by processing vast arrays of data simultaneously. The primary purpose of a GPU is to optimize parallel processing for graphical computations, enabling real-time rendering of , textures, and shaders in applications such as video games and simulations. Unlike general-purpose processors, GPUs are architected with thousands of smaller cores tailored for executing repetitive, data-intensive tasks in parallel, which dramatically improves efficiency for workloads. This parallelization allows GPUs to handle the geometric transformations, calculations, and shading required to generate complex scenes at high frame rates. GPUs have evolved from fixed-function hardware, including early video display processors that performed dedicated tasks like scan-line rendering, to modern programmable architectures. A pivotal shift occurred in the late and early with the introduction of programmable s, transforming GPUs from rigid pipelines to flexible engines capable of custom algorithms. This evolution, marked by milestones such as NVIDIA's in 1999 as the first GPU and subsequent unified shader models, expanded their utility beyond fixed graphics operations to support dynamic, developer-defined processing. At its core, the GPU workflow begins with the input of vertex data, representing 3D model points, which is transformed through vertex shaders to compute screen-space positions and attributes like normals and colors. Primitives such as triangles are then assembled, clipped to the viewport, and rasterized to produce fragments—potential s with interpolated data. Fragment processing follows, where shaders evaluate lighting, texturing, and other effects to determine final values, which are written to the frame buffer for display. This sequential yet highly parallel pipeline ensures efficient traversal from geometric input to rendered output.

Distinction from CPU

Central Processing Units (CPUs) are designed for sequential processing, featuring a small number of powerful cores—typically 4 to 64 in modern consumer models—optimized for general-purpose tasks such as branching, caching, and handling complex control flows. These cores emphasize low-latency execution, enabling efficient management of operating systems, user interactions, and serial workloads where instructions vary dynamically. In contrast, Graphics Processing Units (GPUs) incorporate thousands of simpler cores, often organized into streaming multiprocessors, tailored for massive parallelism in data-intensive operations like matrix multiplications and vector computations. These cores execute hundreds or thousands of threads concurrently, prioritizing high throughput over individual task speed, which makes GPUs ideal for scenarios where many similar computations can proceed independently. A fundamental architectural distinction lies in their execution models: CPUs primarily follow a (MIMD) paradigm under , allowing each core to process different instructions on varied data streams for versatile, control-heavy applications. GPUs, however, employ (SIMT)—a variant of (SIMD)—where groups of threads (e.g., warps of 32) apply the same instruction to different data elements simultaneously, enhancing efficiency for uniform, data-parallel tasks. This SIMD-like approach in GPUs focuses on aggregate throughput, tolerating latency through extensive multithreading, whereas CPUs optimize for rapid serial performance via features like branch prediction and large caches. These differences result in clear trade-offs: GPUs underperform in serial, branch-intensive tasks due to their simplified cores and lack of advanced control mechanisms, but they deliver superior floating-point operations per second (FLOPS) through sheer core volume—for instance, modern GPUs may feature over 10,000 cores compared to a CPU's dozens, enabling orders-of-magnitude higher parallel compute capacity.

Historical Development

Origins in Early Computing (1970s-1990s)

The development of graphics processing units (GPUs) traces its roots to the 1970s, when foundational hardware for raster graphics emerged alongside advancements in display technology. A pivotal invention was the frame buffer, a dedicated memory system capable of storing pixel data for an entire video frame, enabling efficient manipulation and display of images. In 1973, Richard Shoup at Xerox PARC created the SuperPaint system, featuring the first practical 8-bit frame buffer that supported real-time painting and video-compatible output, marking a shift from vector-based to raster graphics. This innovation laid the groundwork for pixel-based rendering by allowing software to directly address individual screen pixels, distinct from earlier line-drawing displays. During the same decade, key rendering algorithms were formulated to handle the complexities of 3D graphics on these emerging systems. Scan-line rendering, which processes images line by line to efficiently compute visible surfaces, was advanced through Watkins' 1970 algorithm for hidden-surface removal, optimizing polygon traversal in image order. , a technique to apply 2D images onto 3D surfaces for enhanced realism, was pioneered by in his 1974 PhD thesis, where he demonstrated to map textures onto polygons without geometric distortion. Complementing this, the Z-buffer algorithm, invented by Catmull in 1974, resolved depth occlusion by storing a depth value per and comparing incoming fragments to determine , enabling robust hidden-surface removal in rasterizers. The 1980s saw the rise of fixed-function hardware accelerators for 2D graphics, transitioning from software-based systems to specialized chips that offloaded drawing tasks from general-purpose CPUs. IBM's 8514 display adapter, introduced in 1987 for the PS/2 personal computers, was a landmark fixed-function chip supporting 1024×768 resolution with for lines, polygons, and bit-block transfers, significantly boosting CAD and graphics performance. Early attempts at 3D acceleration appeared in systems, such as Evans & Sutherland's Picture System series, which evolved from the 1974 vector-based model to raster-capable versions by the late 1970s and , delivering real-time 3D transformations for flight simulators and visualization at rates up to 130,000 vectors per second in the PS 300 (1980). These systems integrated scan-line algorithms with hardware for perspective projection, prioritizing high-speed rendering over consumer accessibility. By the , consumer-oriented GPUs emerged, focusing on 3D acceleration for gaming and multimedia. The Voodoo Graphics, launched in November 1996, was the first widely adopted consumer 3D accelerator, a dedicated add-in card with four pixel pipelines supporting , bilinear filtering, and at resolutions up to 800×600, requiring a separate 2D card for full functionality. It popularized fixed-function 3D pipelines in PCs, achieving frame rates over 30 fps in early titles like Quake. NVIDIA's , released in 1997, advanced this by integrating 2D/3D capabilities on a single chip with dedicated transform and lighting (T&L) hardware, processing up to 1.5 million polygons per second and offloading geometric computations from the CPU. These innovations, building on 1970s algorithms, established GPUs as essential for interactive 3D, setting the stage for broader adoption.

Acceleration of 3D Graphics (2000s)

The early 2000s marked a pivotal shift in GPU design toward greater for 3D graphics, building on fixed-function pipelines to handle increasingly complex scenes in gaming and professional applications. NVIDIA's , released in 1999 but influencing development through the decade, was the first GPU to integrate hardware transform and (T&L) units, offloading geometric computations from the CPU and enabling developers to render more polygons with smoother frame rates. This capability proved essential for titles like , which leveraged T&L to achieve higher detail and performance, setting a benchmark for 3D acceleration. Concurrently, ATI's series emerged as a strong competitor; the 8500 (2001) introduced enhanced multi-texturing for layered surface effects, while the 9700 Pro (2002) became the first GPU to fully support 9, delivering superior pixel fill rates and programmable for realistic and textures. In the mid-2000s, the introduction of programmable shaders revolutionized by allowing developers to customize vertex and beyond fixed functions. 8 (2000) brought the first vertex shaders for deformable and shaders for per-pixel effects like dynamic , with NVIDIA's 3 providing early hardware support. 9 (2002) expanded this with higher-precision shaders (Shader Model 2.0 and 3.0), enabling advanced techniques such as (HDR) lighting, while OpenGL 2.0 (2004) standardized similar programmability across platforms. A landmark innovation came in 2006 with NVIDIA's G80 architecture in the 8800 series, which introduced unified shaders—versatile units that could handle both vertex and tasks dynamically, boosting efficiency by up to 2x in 10 workloads and supporting more complex scenes without idle hardware. These advancements facilitated innovations like multi-texturing, where multiple texture layers combined for detailed surfaces, and , a technique using normal maps to simulate surface irregularities for realistic lighting without additional ; GPU-optimized , as detailed in early implementations, reduced and handled self-shadowing effectively. By the late 2000s, GPUs powered the rise of high-definition (HD) gaming, particularly through console integrations that influenced PC designs. The , launched in 2005, featured ATI's custom Xenos GPU with 48 unified shading units and 256 MB of shared GDDR3 memory, enabling rendering with advanced effects like alpha-to-coverage anti-aliasing for smoother HD visuals in games such as . Similarly, the (2006) incorporated NVIDIA's RSX "Reality Synthesizer," a variant of the 7800 GTX with 24 pixel shaders and 256 MB GDDR3, supporting 9-level features for titles like and driving demand for comparable PC performance. NVIDIA's GPU (2008), powering the GTX 280, served as a precursor to ray tracing by demonstrating real-time interactive ray-traced scenes at 2008, achieving 30 frames per second at with shadows, reflections, and refractions using CUDA-accelerated software on its 1.4 billion transistors. This era also saw memory capacity scale dramatically, with cards like the ATI HD 4870 introducing 1 GB of GDDR5 VRAM in 2008 to handle larger textures and higher resolutions without bandwidth bottlenecks.

Expansion into Compute and AI (2010s-2025)

During the 2010s, graphics processing units expanded significantly into general-purpose computing (GPGPU), enabled by NVIDIA's CUDA platform, which, although introduced in 2006, saw widespread adoption for parallel computing tasks in scientific simulations and early AI applications by the mid-decade. This shift was marked by the 2010 launch of NVIDIA's Fermi architecture, the first consumer GPU to include error-correcting code (ECC) memory, enhancing reliability for compute-intensive workloads beyond graphics. In 2012, the Kepler architecture further advanced GPGPU capabilities with improved double-precision floating-point performance, up to three times that of the previous Fermi generation, making GPUs viable for high-performance scientific computing like molecular dynamics and climate modeling. The mid-2010s witnessed a boom, propelled by GPUs' parallel processing prowess, with NVIDIA's Pascal architecture in 2016 introducing native FP16 support to accelerate training and inference. This laid groundwork for specialized AI hardware, as seen in the 2017 Volta architecture's debut of Tensor Cores, dedicated units for matrix multiply-accumulate operations central to algorithms. contributed with its architecture in 2017, featuring high-bandwidth cache and compute units optimized for workloads, supporting frameworks like for open-source GPGPU programming. Entering the 2020s, GPUs integrated ray tracing hardware starting with NVIDIA's RTX 20-series in 2018, based on the Turing architecture, which added RT Cores for real-time ray tracing in compute simulations like physics rendering and light transport, extending beyond gaming to scientific visualization. AI-specific advancements accelerated with the 2020 A100 GPU on the architecture, delivering up to 312 teraflops of FP16 performance for AI training via third-generation Tensor Cores and multi-instance GPU partitioning for efficient large-scale deployments. The 2022 H100 on the Hopper architecture pushed boundaries further, offering up to 4 petaflops of AI performance with Transformer Engine optimizations for large language models, significantly reducing training times for generative AI. By 2025, GPUs increasingly supported quantum simulations, leveraging libraries like NVIDIA's cuQuantum for high-fidelity modeling of quantum circuits on classical hardware, enabling researchers to prototype quantum algorithms at scales unattainable on CPUs alone. Advancements in neuromorphic-inspired GPU designs emerged around 2023-2025, with hybrid s mimicking neural efficiency for low-power AI, as explored in scalable neuromorphic systems integrated with GPU backends for edge and data-center inference. In 2025, NVIDIA introduced the Blackwell , powering GPUs like the B200 with up to 20 petaFLOPS of FP4 Tensor Core performance (sparse), further accelerating AI for large language models and enabling new scales of generative AI deployment. Concurrently, edge AI accelerators like NVIDIA's Jetson series faced disruptions from surging demand and shortages, delaying deployments but spurring innovations in modular, power-efficient GPU variants for IoT and autonomous systems amid global chip constraints.

Manufacturers and Market Dynamics

Key GPU Manufacturers

NVIDIA, founded on April 5, 1993, by , , and , emerged as a pioneer in graphics processing with a focus on 3D acceleration for gaming and multimedia applications. The company developed the series for consumer gaming, starting with the in 1999, which introduced hardware transform and lighting capabilities. For professional markets, offers the line (rebranded under RTX for workstations), optimized for CAD, CGI, and visualization tasks with certified drivers for stability. In compute applications, the Tesla series, introduced with the Tesla architecture in 2006, targets and scientific simulations, evolving into data center GPUs with features like Tensor Cores. A notable is Deep Learning Super Sampling (DLSS), first released in February 2019, which uses AI to upscale images and boost performance in real-time rendering. Advanced Micro Devices () entered the GPU market through its acquisition of in July 2006, integrating ATI's graphics expertise to expand beyond CPUs. The series, originating from ATI's designs, serves consumer and professional graphics needs, emphasizing high-performance rasterization and ray tracing in modern iterations. AMD has prioritized open-source drivers since 2007, releasing documentation and code for and later, enabling community-driven development through projects like AMDGPU for compatibility. Additionally, AMD's Accelerated Processing Units (APUs) combine CPU and GPU on a single die, starting with the Fusion architecture in 2011, to deliver integrated solutions for laptops and desktops with access. Intel has long incorporated integrated GPUs (iGPUs) into its processors, with the first widespread adoption in the Clarkdale architecture in January 2010, providing basic graphics acceleration without discrete cards. These iGPUs, branded as Intel HD Graphics and later Iris Xe, handle everyday computing and light gaming directly on the CPU die. In 2022, Intel launched its discrete Arc series, targeting entry-to-midrange gaming and content creation with the Alchemist architecture, marking the company's re-entry into standalone GPUs after the 1998 i740. Other notable manufacturers include , which designs the series of GPUs for mobile and embedded systems, licensed to SoC makers for power-efficient rendering in smartphones and tablets, with recent models like the Immortalis-G925 incorporating ray tracing. Qualcomm integrates the GPUs into its Snapdragon processors, optimizing for mobile gaming and AR/VR with features like variable rate shading since the Adreno 660 series in 2021. Apple develops custom GPUs for its M-series chips, debuting in the M1 SoC in November 2020, featuring unified memory architecture for seamless CPU-GPU data sharing in Macs and iPads. GPU designers predominantly rely on Taiwan Semiconductor Manufacturing Company (TSMC) for fabrication, as NVIDIA, AMD, and others lack in-house foundries for advanced nodes. By 2025, TSMC's 3nm process (N3) supports high-volume production for mobile and upcoming AI GPUs, while advanced nodes like TSMC's 5nm and 4nm processes are used in AMD's and NVIDIA's Hopper architectures, respectively, offering improved density and efficiency. The shift to 2nm (N2) processes is underway, with slated for the second half of 2025, promising further scaling via gate-all-around transistors for next-generation discrete and integrated GPUs. The GPU industry operates as an oligopoly, primarily controlled by NVIDIA, AMD, and an emerging Intel in the discrete segment. In 2023, NVIDIA commanded approximately 88% of the discrete GPU market share, with AMD holding around 12% and Intel maintaining a minimal presence below 1%. By 2024, NVIDIA's dominance strengthened to about 84-92% across quarters, while AMD's share hovered at 8-12% and Intel remained under 1%. This trend intensified in 2025, with NVIDIA reaching 94% of the discrete market in Q2, AMD dropping to 6%, and Intel still below 1%, driven by NVIDIA's superior positioning in high-performance segments. Global GPU market revenue experienced significant fluctuations, influenced by external factors like mining and AI adoption. Valued at around $40 billion in , the market grew to $52.1 billion in 2023 amid recovering demand post-shortages. It peaked at approximately $63 billion in 2024, propelled by surging AI workloads that boosted GPU sales to approximately $16 billion in 2024 (as of estimates). Projections for 2025 estimate further expansion to $100-150 billion overall, with segments alone reaching $120 billion, underscoring AI's role in sustaining growth. In 2025, 's Blackwell GPUs continued to drive AI growth, while prepared RDNA 4 for consumer markets. The mining boom from 2017 to 2021 inflated GPU demand, contributing up to 25% of 's shipments in peak quarters, but the crash led to excess inventory, a $5.5 million SEC fine for over undisclosed impacts, and a 50-60% price drop in consumer GPUs by mid-2023. Competition in the GPU market is intensified by price pressures, supply dynamics, and shifting demand priorities. and have engaged in aggressive price competition, particularly in mid-range cards like the RTX 4060/4070 series versus RX 7600/7700, with real-world pricing falling 20-30% in 2025 to attract gamers amid stabilizing supply. Supply shortages from 2020 to 2022, exacerbated by , cryptocurrency mining surges, and U.S.- trade tensions, caused GPU prices to double or triple, delaying consumer upgrades and benefiting enterprise buyers. By 2025, the market has shifted toward AI dominance, where captures 93% of server GPU revenue, marginalizing consumer competition as hyperscalers prioritize high-end accelerators over mid-range gaming products. Regionally, the GPU ecosystem features concentrated manufacturing in alongside design innovation in the U.S. and . serves as the primary hub for fabrication, with Taiwan's producing over 90% of advanced GPUs, supporting explosive growth in the region's GPU market to $44.6 billion by 2034 at a 20.8% CAGR. In contrast, the U.S. leads in R&D and design, where firms like , , and develop architectures, while contributes through specialized applications in automotive and simulation. This division enhances efficiency but exposes the industry to geopolitical risks, such as U.S. export controls on advanced chips to in 2024-2025.

Architectural Components

Processing Cores and Pipelines

At the heart of a GPU's parallel processing capability are its processing cores, which execute computational tasks in a highly concurrent manner. In architectures, these are known as cores, which serve as the fundamental units for performing floating-point and integer arithmetic operations within the Streaming Multiprocessors (SMs). Each core is a pipelined capable of handling scalar operations, with modern implementations supporting single-precision (FP32) fused multiply-add (FMA) instructions at high throughput. Similarly, GPUs employ stream processors as their core execution units, organized within Compute Units (CUs) to handle vectorized arithmetic and logic operations on groups of threads. These stream processors, part of the Vector ALU (VALU), execute instructions like V_ADD_F32 for 32-bit additions or V_FMA_F64 for 64-bit fused multiply-adds, enabling efficient data-parallel computation across work-items. To accelerate matrix-heavy workloads such as , introduced tensor cores in 2017 with the Volta architecture, specialized hardware units that perform mixed-precision matrix multiply-accumulate (MMA) operations. Each tensor core executes a 4x4x4 MMA in FP16 input with FP32 accumulation per clock cycle, providing up to 64 FP16 FMA operations, which significantly boosts throughput for AI training and inference compared to standard cores. These cores integrate seamlessly into the SM structure, with later architectures like enhancing them to support additional precisions like FP8 and INT8 for broader applicability. The graphics processing pipeline in GPUs consists of sequential stages that transform 3D scene data into a 2D rendered image, leveraging the cores for programmable computations. The pipeline begins with vertex fetch, where vertex data is retrieved from memory, followed by (including vertex shading and ) to compute positions and attributes. Primitive assembly then forms triangles from vertices, leading to rasterization, which generates fragments (potential pixels) by scanning primitives against the screen. Fragment shading applies per-fragment computations for color and texture, and finally, the output merger resolves depth, blending, and writes the final pixels to the . This fixed-function and programmable flow ensures efficient handling of rendering tasks, with programmable stages executed on the processing cores. GPUs achieve massive parallelism through the (SIMT) execution model, where groups of threads execute the same instruction concurrently on multiple data elements. In GPUs, threads are bundled into warps of 32 threads, scheduled by warp schedulers within each SM to hide latency from long-running operations like accesses. employs a similar SIMT approach but uses wavefronts of 32 or 64 work-items, executed in across stream processors, with the EXEC controlling active lanes to support divergent execution paths. This model allows thousands of threads to overlap execution, maximizing core utilization. Scalability in GPU architectures is achieved by grouping processing cores into larger units, such as NVIDIA's Streaming Multiprocessors (), which contain multiple and tensor cores along with schedulers and caches. In the datacenter-focused GA100 (A100 GPU), each SM includes 64 FP32 cores and 4 tensor cores, enabling the A100 GPU to feature 108 SMs for a total of 6912 cores. Consumer GPUs, such as the RTX 30 series (GA102/104 dies), feature 128 FP32 cores per SM. In AMD designs, stream processors are clustered into Compute Units (CUs), with each CU handling up to 64 stream processors in RDNA architectures, allowing high-end GPUs to scale to hundreds of CUs for enhanced parallelism.

Memory Systems and Bandwidth

Graphics processing units (GPUs) rely on a sophisticated to manage the high volume of data required for parallel computations, ensuring efficient access speeds that match the demands of rendering, compute tasks, and AI workloads. At the lowest level, registers provide the fastest access, storing immediate operands for cores with latencies under a few cycles. These are followed by L1 caches, which are small, on-chip stores per streaming multiprocessor (SM) or compute unit, offering low-latency access for frequently used data and often configurable as for thread cooperation. L2 caches serve as a larger, chip-wide buffer shared across all cores, aggregating data from global to reduce off-chip traffic. Global , typically implemented as video RAM (VRAM), forms the bulk storage for textures, framebuffers, and large datasets, accessed via high-speed DRAM. In integrated GPUs, unified architectures allow seamless sharing between CPU and GPU address spaces, minimizing data copies through virtual addressing. Memory types in GPUs are optimized for bandwidth over capacity, with discrete variants favoring high-performance DRAM to sustain peak throughput. GDDR7, the latest double-data-rate synchronous dynamic RAM variant, delivers high bandwidth, reaching up to 1.8 TB/s in consumer cards such as the RTX 5090 as of 2025, enabling rapid data feeds for 4K and 8K rendering. For data center and professional applications, High Bandwidth Memory 3 (HBM3) and its extension HBM3e stack multiple DRAM dies vertically using through-silicon vias, achieving bandwidths up to 8 TB/s per GPU in configurations like the Blackwell B200 as of 2025, critical for large-scale AI training where memory-intensive operations dominate. These memory types interface with the GPU via wide buses; for instance, a 384-bit bus width allows parallel transfer of 384 bits per clock cycle, scaling total bandwidth proportionally to clock speed and directly impacting frame rates in bandwidth-limited scenarios. Bandwidth limitations often manifest as bottlenecks during texture fetching, where shaders repeatedly sample large 2D/3D arrays from global memory, consuming significant VRAM throughput and stalling pipelines if cache misses occur. Texture units mitigate this through dedicated caches and filtering hardware, but in high-resolution scenarios, uncoalesced accesses or excessive mipmapping can saturate the memory bus, reducing effective utilization to below 50% of peak. Advancements address these challenges: Error-correcting code (ECC) memory, standard in professional GPUs like AMD Radeon PRO series, detects and corrects single-bit errors in VRAM, ensuring data integrity for mission-critical simulations without halting execution. By 2025, trends toward Compute Express Link (CXL) interconnects enable pooled memory across GPUs and hosts, allowing dynamic allocation of terabytes of shared DRAM over PCIe-based fabrics with latencies on the order of 100-200 ns, reducing silos and boosting efficiency in disaggregated AI clusters.

GPU Variants

Discrete and Integrated GPUs

Discrete graphics processing units (GPUs), also known as dedicated or standalone GPUs, are separate hardware components typically installed as expansion cards via interfaces like PCIe in desktop systems or soldered onto motherboards in laptops. These dGPUs are engineered for high-performance tasks such as gaming and professional workloads, including video rendering and , where they deliver superior computational throughput compared to integrated alternatives. High-end models, like those from NVIDIA's RTX series or AMD's RX lineup, often feature power draws ranging from 300W to 600W under load, necessitating robust power supplies and cooling solutions to manage thermal output. In contrast, integrated GPUs (iGPUs) are embedded directly on the same die as the central processing unit (CPU) within a system-on-chip (SoC) design, as seen in Intel's UHD Graphics series or AMD's Radeon Vega-based integrated solutions. These iGPUs are optimized for lower-power environments, with typical thermal design power (TDP) allocations of 15W to 65W as part of the overall CPU package, making them suitable for everyday computing in laptops and office desktops, such as web browsing, video streaming, and light productivity applications. Their efficiency stems from shared access to system resources, which minimizes additional hardware overhead. The primary trade-offs between dGPUs and iGPUs revolve around , , and form factor constraints. dGPUs benefit from dedicated (VRAM), often GDDR6 or HBM types, which enables faster data access and higher bandwidth for complex rendering without competing with CPU operations; they also incorporate independent cooling systems, such as multi-fan heatsinks or cooling compatibility, to sustain peak over extended periods. Conversely, iGPUs rely on shared system RAM for operations, which can introduce bottlenecks under heavy loads but allows for slimmer, more portable device designs by eliminating the need for separate components and reducing overall power and heat generation. By 2025, iGPUs hold a dominant position in consumer PCs, comprising over 70% of the global GPU market and appearing in approximately 80% of entry-level and systems due to their cost-effectiveness and suitability for general use. In AI servers, however, dGPUs prevail, with capturing around 93% of server GPU revenue through high-performance discrete cards optimized for tasks like training.

Specialized Forms (Mobile, External, Hybrid)

Mobile GPUs are specialized low-power variants designed for battery-constrained devices such as and laptops, prioritizing over raw to manage thermal dissipation within tight limits. NVIDIA's series, for instance, integrates GPU cores into system-on-chip (SoC) designs for mobile platforms, with the Tegra 4 achieving up to 45% lower power consumption than its predecessor in typical use cases, enabling extended battery life in devices like tablets and portable gaming systems. Similarly, Qualcomm's GPUs, embedded in Snapdragon processors, deliver acceleration for while adhering to low-power budgets typically under 15W for smartphone SoCs, balancing high-frame-rate rendering with heat management in compact form factors. As of 2025, the Adreno GPU in the Snapdragon 8 Elite series offers 23% improved and 37% faster AI processing compared to previous generations, enabling advanced on-device AI features. These adaptations often involve clock throttling and architecture optimizations to sustain under power budgets far below those of desktop counterparts. External GPUs (eGPUs) extend graphics capabilities by housing desktop-class GPUs in enclosures connected via high-speed interfaces, allowing users to upgrade portable systems without internal modifications. Introduced commercially in 2017 with 3 support, enclosures like the Razer Core enabled seamless integration of full-sized GPUs into laptops, mitigating the bandwidth limitations of earlier standards. Modern iterations, such as the Razer X V2, leverage 5 and for up to 120 Gbps bidirectional throughput, accommodating quad-slot GPUs and providing 140W charging to compatible devices. This setup incurs a performance overhead of 10-30% due to interface latency but unlocks desktop-level rendering and compute tasks for mobile workflows. Hybrid GPU solutions combine integrated and discrete graphics in a single system, dynamically switching between them to optimize power and performance, often through technologies like . Optimus employs a software layer to render frames on the efficient integrated GPU (iGPU) before passing them to the discrete GPU (dGPU) only when high performance is needed, reducing idle power draw in laptops. Advanced variants, such as NVIDIA Advanced Optimus introduced in recent years, enable direct switching of the display output between GPUs via embedded , minimizing latency and supporting workloads where CPU, iGPU, and dGPU collaborate on tasks like AI inference. AMD's Accelerated Processing Units () further exemplify this by fusing CPU and GPU on a single die, facilitating unified memory access and parallel processing in power-sensitive environments. By 2025, trends in specialized GPUs emphasize AI integration, with mobile chips like Qualcomm's Snapdragon series incorporating GPUs optimized for on-device neural processing. These advancements support efficient edge AI in smartphones, with the global AI chip market projected to reach $40.79 billion in 2025, to which mobile AI applications contribute significantly (estimated at over $20 billion). Emerging prototypes explore eGPU connectivity, aiming to eliminate physical tethers through high-bandwidth standards, though commercial viability remains in early stages amid challenges in latency and power transfer.

Capabilities and Applications

Rendering and Graphics APIs

GPUs play a central role in the rendering pipeline, which transforms 3D models into 2D images displayed on screens through a series of programmable stages. This process begins with vertex processing, where 3D model coordinates are transformed and lit using vertex shaders, followed by to assemble primitives like triangles. Rasterization then projects these primitives onto the screen, converting them into fragments or pixels, which are shaded by fragment shaders to determine final colors based on textures, lighting, and materials. The pipeline concludes with output merging, where fragments are blended and written to the for display. Powerful GPUs with at least 8 GB VRAM are essential for efficient 3D rendering in creative workflows, providing the parallel processing power and memory capacity to handle complex geometries, high-resolution textures, and real-time computations. The primary rendering technique in GPUs has long been rasterization, which efficiently scans and fills polygons to generate images at high frame rates suitable for real-time applications. However, rasterization approximates complex lighting effects like reflections and shadows. To address this, ray tracing simulates light paths by tracing rays from the camera through each pixel, intersecting with scene geometry to compute accurate , shadows, and refractions. Hardware-accelerated ray tracing became viable in consumer GPUs with NVIDIA's Turing architecture in 2018, introducing dedicated RT cores to accelerate ray-triangle intersections and traversals. Modern GPUs often employ hybrid rendering, combining rasterization for primary visibility with ray tracing for secondary effects to balance performance and realism. For 2D graphics, GPUs accelerate vector-based rendering to ensure crisp scaling without pixelation, supporting applications like user interfaces and diagrams. , Microsoft's hardware-accelerated introduced in , leverages the GPU for immediate-mode 2D drawing operations, including paths, gradients, and text, optimizing for efficient GPU submission. OpenVG, a standard, provides a cross-platform interface for 2D vector graphics acceleration on embedded and mobile devices, handling transformations, fills, and strokes via GPU pipelines. These APIs reduce CPU overhead by offloading anti-aliased rendering and compositing to the GPU, enabling smooth animations and high-resolution displays. In 3D graphics, low-level APIs enable direct GPU control for complex scenes in games and simulations. , released by the in 2016, offers explicit memory management and low-overhead command submission, allowing developers to minimize driver intervention and maximize parallelism across GPU cores. DirectX 12, Microsoft's counterpart, similarly exposes low-level hardware access for Windows platforms, supporting features like multi-threading and resource binding to reduce latency. remains a widely used cross-platform for , though its higher-level abstractions can introduce overhead compared to . Programmable shaders are integral to these APIs; compiles to SPIR-V for and , enabling custom vertex, geometry, and fragment processing. HLSL (High-Level Shading Language) serves DirectX, providing similar programmability with DirectX-specific optimizations. Recent advancements have enhanced rendering fidelity without sacrificing performance. Real-time , enabled by ray tracing hardware, simulates indirect lighting bounces for dynamic scenes, as seen in engines like where rays compute diffuse interreflections per frame. AI-driven upscaling techniques further address computational demands; NVIDIA's DLSS uses tensor cores and to upscale lower-resolution frames with temporal data, achieving 4K-quality output at higher frame rates, with DLSS 4 widespread by 2025. AMD's FSR employs spatial and temporal upsampling algorithms, compatible across vendors, and by 2025 includes FSR 4 with AI enhancements for improved detail reconstruction. These methods allow GPUs to deliver photorealistic visuals in real-time, transforming interactive graphics.

General-Purpose Computing (GPGPU)

General-purpose computing on graphics processing units (GPGPU) refers to the utilization of GPUs as versatile co-processors for data-parallel workloads beyond traditional graphics rendering, such as scientific simulations and data processing tasks. This leverages the GPU's of thousands of simple cores optimized for execution, enabling significant speedups over CPU-only approaches for suitable algorithms. The concept gained prominence with NVIDIA's introduction of in 2006, which provided a C/C++-like to map general-purpose kernels onto GPU thread blocks and grids, treating the GPU as an extension of the CPU for compute-intensive operations. Key frameworks have facilitated GPGPU adoption across vendors. CUDA remains NVIDIA-specific but dominant, supporting and optimized libraries for parallel primitives. , released by the in 2009, offers a vendor-agnostic alternative with a C99-based kernel for heterogeneous platforms including CPUs, GPUs, and accelerators, promoting portability through platform models and execution environments. AMD's platform, launched in 2016, provides an open-source ecosystem for its GPUs, while —a C++ runtime —enables source-to-source translation of code to or back, enhancing portability without full rewrites. These tools abstract hardware details, allowing developers to express parallelism via kernels executed on SIMD-like warps or wavefronts. GPGPU finds applications in domains requiring high-throughput floating-point operations, such as scientific computing where GPUs accelerate simulations by parallelizing force calculations across atom interactions; for instance, early implementations achieved up to 20-fold speedups on models using all-atom representations. In media processing, GPUs handle video encoding tasks like and in parallel, reducing transcoding times for formats such as H.264 through compute shaders. Basic cryptocurrency mining algorithms, like SHA-256 hashing for early variants, also exploit GPU parallelism to evaluate nonce values across threads, yielding orders-of-magnitude efficiency gains over CPUs before ASIC dominance. These uses highlight GPGPU's strength in problems with regular data access patterns. Despite advantages, GPGPU faces limitations inherent to GPU . Branch divergence occurs when threads in a warp (typically 32 on or 64 on ) take different conditional paths, serializing execution as the hardware executes one branch at a time while masking inactive threads, incurring up to 32x slowdowns in divergent cases compared to execution. Additionally, transfer overhead via PCIe interconnects—limited to 16-32 GB/s bidirectional on modern versions—bottlenecks performance for workloads with frequent host-device copies, often comprising 20-50% of total latency in non-unified setups; techniques like pinned or asynchronous transfers mitigate but do not eliminate this.

Emerging Roles in AI and Simulation

Graphics processing units (GPUs) have become indispensable in (AI) and (ML) workflows, particularly for neural networks through , a process that involves intensive parallel computations for gradient calculations across vast datasets. This parallelism enables GPUs to handle the matrix multiplications and tensor operations essential for models, outperforming traditional CPUs by orders of magnitude in training times for large-scale neural architectures. Powerful GPUs with at least 8 GB VRAM are essential for AI processing in creative applications, such as generative models for image and video synthesis, as they provide the memory to store model parameters, activations, and batches during inference and fine-tuning. For instance, NVIDIA's Engine optimizes tensor operations in transformer-based models by leveraging 8-bit floating-point (FP8) precision on compatible GPUs, reducing usage and accelerating while maintaining model accuracy. In simulation domains, GPUs facilitate high-fidelity modeling of complex physical phenomena, such as and systems, by parallelizing iterative solvers in physics engines. Tools like Fluent, when GPU-accelerated, can perform fluid simulations up to 10 times faster than CPU-based methods, with speedups varying by simulation type and hardware, enabling engineers to iterate designs more rapidly in and automotive applications. Similarly, in modeling, GPU-based ocean dynamical cores, such as those implemented in Oceananigans.jl, support mesoscale eddy-resolving simulations with enhanced resolution and speed, aiding predictions of ocean-atmosphere interactions critical for forecasting environmental changes. These capabilities extend to real-time simulations in (VR) environments, where GPUs enable interactive ray tracing for immersive physics-based experiences, though this remains computationally demanding. As of 2025, GPUs play a pivotal role in accelerating generative AI tasks, exemplified by models like , which rely on GPU tensor cores for efficient diffusion processes in image synthesis from textual prompts. series GPUs, with their high VRAM and optimization, allow for local inference and fine-tuning of such models, though the maximum number of parameters feasible for inference is limited by GPU memory constraints, including precision formats (e.g., FP16/BF16 or quantized INT8/INT4), framework overhead (typically 10-20%), and KV cache size, which scales with context length and batch size. In edge AI for autonomous vehicles, embedded GPUs process data in real-time for and , mitigating latency issues associated with cloud dependency and enhancing safety through on-device inference. Despite these advances, challenges persist in scaling AI applications across multi-GPU clusters, including interconnect bottlenecks and overheads that limit efficient distributed for massive models. Ethical concerns also arise in AI , particularly regarding biases in datasets used for optimization, which can perpetuate societal inequities if not addressed through diverse data curation and auditing practices.

Performance and Efficiency

Evaluation Metrics and Benchmarks

Graphics processing units (GPUs) are evaluated using several standardized metrics that quantify their computational capabilities and throughput. Teraflops (TFLOPS) measure peak theoretical floating-point operations per second, serving as a primary indicator of compute performance across various precisions like FP32 or FP16, with higher values denoting greater potential for parallel processing tasks. Frames per second (FPS) assesses rendering speed in gaming and real-time graphics, directly correlating with user-perceived smoothness at given resolutions. Memory , expressed in gigabytes per second (GB/s), quantifies data transfer rates between the GPU's and processing cores, critical for bandwidth-intensive workloads where low values can bottleneck performance. Standardized benchmarks provide reproducible ways to compare GPU performance across domains. For consumer graphics and gaming, evaluates 12-based rendering and ray tracing capabilities through tests like Time Spy for general graphics and for real-time ray tracing effects. In professional applications such as CAD and visualization, SPECviewperf 15 (released May 2025) serves as the industry standard, simulating workloads from software like 3ds Max, , , , and using , 12, and APIs to measure 3D graphics throughput in shaded, wireframe, and transparency modes. For AI and , MLPerf Inference benchmarks, initiated in 2018 through an industry-academic collaboration and now governed by MLCommons, assess model execution speed and latency on GPUs, including metrics like tokens per second for language models and 90th- or 99th-percentile latency in single- and multi-stream scenarios. Benchmarks distinguish between synthetic tests, which isolate specific features like ray tracing in Port Royal to evaluate hardware limits under controlled conditions, and real-world scenarios that better reflect application performance but vary with software optimizations. Synthetic tests are essential for highlighting capabilities such as real-time ray tracing, where scores reveal how GPUs handle complex light simulations without game-specific variables. By 2025, standards like MLPerf Inference v5.1 incorporate AI-specific metrics, emphasizing inference latency for tasks like Llama 3.1 processing, with offline throughput exceeding thousands of queries per second on high-end GPUs to establish benchmarks for edge and datacenter deployment. Performance evaluation must account for influencing factors like resolution scaling and driver optimizations. Higher resolutions, such as 4K versus , increase GPU load and reduce FPS due to greater pixel counts, with benchmarks often scaling results geometrically across titles to normalize comparisons. Driver updates from manufacturers like and can enhance performance by 10-20% in targeted workloads through better and support, necessitating periodic retesting to capture these improvements accurately.

Power Consumption and Optimization

Graphics processing units exhibit significantly higher power consumption compared to central processing units due to their optimized for massive parallelism, which involves thousands of cores operating simultaneously. This leads to (TDP) ratings that can reach substantial levels; for instance, NVIDIA's H100 PCIe GPU has a TDP of 350 W, while AMD's MI300A accelerator ranges from 550 W to 760 W depending on configuration. Such power demands are particularly pronounced in environments, where GPU clusters for AI training can consume kilowatts per node, necessitating advanced cooling and power delivery systems. Power usage in GPUs is influenced by both dynamic and static components. Dynamic power, which dominates during active computation, scales with the square of the supply voltage and linearly with clock frequency and switching activity across cores and memory hierarchies. Static power, arising from leakage currents, becomes more significant at smaller process nodes and under low-utilization scenarios. Workload characteristics play a key role: compute-bound tasks like matrix multiplications in general-purpose GPU (GPGPU) applications draw more power than memory-bound graphics rendering, with variations up to 71 W observed across identical NVIDIA P100 GPUs under the same kernels. Additionally, GPU utilization—often below 50% in high-performance computing workloads—exacerbates inefficiency, as idle cores still contribute to baseline power draw. Hardware-level optimizations are essential for mitigating these issues. Dynamic voltage and (DVFS) dynamically adjusts voltage and clock speed to match intensity, enabling energy savings of 20-50% with performance penalties under 10% in many cases, as implemented in modern , , and GPUs. , a technique that halts clock signals to inactive circuit blocks, reduces dynamic power by eliminating unnecessary toggling, particularly effective in shader cores and memory controllers. complements this by isolating power supplies to dormant units, such as unused streaming multiprocessors, targeting static leakage and achieving up to 90% power reduction in idle states without performance impact. These methods are integrated into GPU architectures via hardware counters and , allowing real-time profiling for power modeling. Architectural and software innovations further drive efficiency gains. Advances in fabrication processes, from 12 nm to 4 nm nodes, have halved power per while scaling density, improving overall . Specialized hardware like tensor cores in GPUs and matrix cores in accelerators optimize for AI workloads, delivering up to 4x higher throughput at similar power levels through reduced precision computations. On the software side, techniques such as data quantization—reducing bit precision from 32 to 8 bits—and kernel fusion, which combines operations to minimize memory accesses, can enhance energy efficiency by 2-5x for inference. In data centers, GPU power capping at 50-70% of TDP sustains 85% performance for certain HPC benchmarks while cutting use by up to 50%. Emerging methods, including learning-based DVFS tuning, promise additional 10-20% improvements by predicting workload patterns offline.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.