Recent from talks
Nothing was collected or created yet.
Piledriver (microarchitecture)
View on Wikipedia| General information | |
|---|---|
| Launched | May 15, 2012 |
| Common manufacturer | |
| Physical specifications | |
| Sockets |
|
| Architecture and classification | |
| Technology node | 32 nm SOI GF |
| Instruction set | AMD64 (x86-64) |
| Products, models, variants | |
| Core names | |
| History | |
| Predecessor | Bulldozer - Family 15h |
| Successor | Steamroller - Family 15h (3rd-gen) |
| Support status | |
| iGPU unsupported | |
AMD Piledriver Family 15h is a microarchitecture developed by AMD as the second-generation successor to Bulldozer. It targets desktop, mobile and server markets. It is used for the AMD Accelerated Processing Unit (formerly Fusion), AMD FX, and the Opteron line of processors.
The changes over Bulldozer are incremental. Piledriver uses the same "module" design. Its main improvements are to branch prediction and FPU/integer scheduling, along with a switch to hard-edge flip-flops to improve power consumption. This resulted in clock speed gains of 8–10% and a performance increase of around 15% with similar power characteristics.[1] FX-9590 is around 40% faster than Bulldozer-based FX-8150, mostly because of higher clock speed.[citation needed]
Products based on Piledriver were first released on 15 May 2012 with the AMD Accelerated Processing Unit (APU), code-named Trinity, series of mobile products.[2] APUs aimed at desktops followed in early October 2012 with Piledriver-based FX-series CPUs released later in the month.[3][4] Opteron server processors based upon Piledriver were announced in early December 2012.[5]
Design
[edit]Piledriver includes improvements over the original Bulldozer microarchitecture:[6][7]
- Clustered Multi-Thread
- Higher clock rates
- Instructions per clock (IPC) improvements
- Lower power consumption and temperatures
- Turbo Core 3.0
- Faster integrated memory controller (IMC)
- Fixed hardware divider
- Improved branch prediction and prefetching
- Perceptron branch predictor[8]
- Improved floating-point and integer scheduling
- Support for Advanced Vector Extensions (AVX) 1.1,[9][10] FMA3, BMI1 and TBM
- Larger L1 translation lookaside buffers (TLB) and L2 efficiency improvements
- Switch to hard-edge flip-flops, allowing a decrease in power consumption
- Cyclos resonant clock mesh (RCM) technology[11]
- 17–220 W thermal design power (TDP)
Features
[edit]Processors
[edit]Desktop
[edit]| Model | CPU | GPU | TDP (W) |
DDR3 Memory |
Turbo Core 3.0 |
Socket | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [Modules/FPUs] Cores/threads |
Freq. (GHz) | Cache | Model | Config | Freq. (MHz) | |||||||
| Base | Turbo | L2 | L3 | |||||||||
| FX-9590 | [4]8 | 4.7 | 5.0 | 4× 2MB | 8MB | N/a | 220 | 1866 | Yes | AM3+ | ||
| FX-9370 | 4.4 | 4.7 | ||||||||||
| FX-8370 | 4.0 | 4.3 | 125 | |||||||||
| FX-8370E | 3.3 | 95 | ||||||||||
| FX-8350 | 4.0 | 4.2 | 125 | |||||||||
| FX-8320 | 3.5 | 4.0 | ||||||||||
| FX-8320E | 3.2 | 95 | ||||||||||
| FX-8310 | 3.4 | 4.3 | 95 | |||||||||
| FX-8300 | 3.3 | 4.2 | 95 | |||||||||
| FX-6350 | [3]6 | 3.9 | 4.2 | 3× 2MB | 125 | |||||||
| FX-6300 | 3.5 | 4.1 | 95 | |||||||||
| FX-4350 | [2]4 | 4.2 | 4.3 | 2× 2MB | 125 | |||||||
| FX-4320 | 4.0 | 4.2 | 4MB | 95 | ||||||||
| FX-4300 | 3.8 | 4.0 | ||||||||||
| A10-6800K | 4.1 | 4.4 | N/a | HD 8670D | 384:24:8 | 844 | 100 | 2133 | FM2 | |||
| A10-6700 | 3.7 | 4.3 | 65 | 1866 | ||||||||
| A10-5800K | 3.8 | 4.2 | HD 7660D | 800 | 100 | |||||||
| A10-5700 | 3.4 | 4.0 | 760 | 65 | ||||||||
| A8-6600K | 3.9 | 4.2 | HD 8570D | 256:16:8 | 844 | 100 | ||||||
| A8-6500 | 3.5 | 4.1 | 800 | 65 | ||||||||
| A8-5600K | 3.6 | 3.9 | HD 7560D | 760 | 100 | |||||||
| A8-5500 | 3.2 | 3.7 | 65 | |||||||||
| A6-6400K | [1]2 | 3.9 | 4.1 | 1MB | HD 8470D | 192:12:4 | 800 | |||||
| A6-5400K | 3.6 | 3.8 | HD 7540D | 760 | ||||||||
| A4-5300 | 3.4 | 3.6 | HD 7480D | 128:8:4 | 723 | 1600 | ||||||
| A4-4000 | 3.0 | 3.2 | 1333 | |||||||||
The K suffix denotes an unlocked A-series processor. All FX-series processors are unlocked unless otherwise specified.
Mobile
[edit]
| Model | CPU | GPU | TDP
(W) |
DDR3
Memory |
Socket | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| [Modules/FPUs] | Freq.(GHz) | L2
Cache (MB) |
Model | Config | Freq.(GHz) | ||||||
| Base | Turbo | Base | Turbo | ||||||||
| A10-5750M | [2]4 | 2.5 | 3.5 | 2×
2MB |
HD 8650G | 384:24:8 | 533 | 720 | 35 | 1866 | FS1r2 |
| A10-4600M | 2.3 | 3.2 | HD 7660G | 496 | 685 | ||||||
| A8-5550M | 2.1 | 3.1 | HD 8550G | 256:16:8 | 554 | 720 | |||||
| A8-4500M | 1.9 | 2.8 | HD 7640G | 496 | 685 | ||||||
| A6-5350M | [1]2 | 2.9 | 3.5 | 1 | HD 8450G | 192:12:4 | 533 | 720 | 1600 | ||
| A6-4400M | 2.7 | 3.2 | HD 7520G | 496 | 685 | ||||||
| A4-5150M | 3.3 | HD 8350G | 128:8:4 | 514 | 720 | ||||||
| A4-4300M | 2.5 | 3.0 | HD 7420G | 480 | 655 | ||||||
| A10-5757M | [2]4 | 2.5 | 3.5 | 2×
2MB |
HD 8650G | 384:24:8 | 533 | 720 | 35 | 1600 | FP2(BGA) |
| A10-5745M | 2.1 | 2.9 | HD 8610G | 626 | 25 | 1333 | |||||
| A10-4655M | 2.0 | 2.8 | HD 7620G | 360 | 496 | ||||||
| A8-5557M | 2.1 | 3.1 | HD 8550G | 515 | 720 | 35 | 1600 | ||||
| A8-5545M | 1.7 | 2.7 | HD 8510G | 450 | 554 | 19 | 1333 | ||||
| A8-4555M | 1.6 | 2.4 | HD 7600G | 320 | 424 | ||||||
| A6-5357M | [1]2 | 2.9 | 3.5 | 1 | HD 8450G | 192:12:4 | 533 | 720 | 35 | 1600 | |
| A6-5345M | 2.2 | 2.8 | HD 8410G | 450 | 600 | 17 | 1333 | ||||
| A6-4455M | 2.1 | 2.6 | HD 7500G | 256:16:8 | 327 | 424 | |||||
| A4-5145M | 2.0 | HD 8310G | 192:12:4 | 424 | 544 | ||||||
| A4-4355M | 1.9 | 2.4 | HD 7400G | 327 | 424 | ||||||
Server
[edit]Some Opteron 32 nm processors.
History
[edit]Komodo platform
[edit]Leaked roadmaps showed Piledriver CPUs featuring up to ten cores as part of the Komodo platform. Komodo was to launch in 2012 on the FM2 socket, but this never happened. AMD kept the AM3+ socket for the FX series and put the Piledriver-based APUs on FM2.[12]
FX-series, Athlon and Opteron
[edit]In 2010[13] AMD revealed that the 2nd generation was scheduled for 2012; AMD referred to this generation as Enhanced Bulldozer. This later generation of Bulldozer core was codenamed Piledriver.
- Vishera FX-series CPU – Desktop Performance market (Volan platform):[14] This FX-series aimed at 95–220 W TDP features 4, 6 and 8 Piledriver core CPU models; with Turbo Core 3.0 while using the existing Socket AM3+ format and 900 series motherboard chipsets of the 1st generation FX-series Zambezi processor. The 2nd generation FX-series was released on 23 October 2012 with the FX-8350, FX-8320, FX-6300 and FX-4300 CPU models. The FX-8350 featured slightly improved power consumption and was found to be approximately 15% more powerful than the fastest Bulldozer CPU. The 2nd generation FX-series was praised for its affordability. The FX 8320 was recognized as a price/performance winner, often matching Intel's i7 2600 at half the cost.[15][16] The Vishera CPUs competed well when compared to similarly priced Intel Ivy Bridge CPUs in multi-core-aware applications and somewhat underperform in overall efficiency and in tasks where most CPU cores were not fully utilized such as single-threaded applications and a number of games.[17][18]
On June 11, 2013, AMD announced two additional FX-series eight Piledriver core CPUs, the FX-9590 and FX-9370, running at a maximum turbo speed of 5.0 GHz and 4.7 GHz respectively, making AMD the first company to ever release a 5 GHz CPU commercially.[19] AMD specify that the 9xxx series processors require "robust liquid cooling" due to their high Thermal Design Power (TDP)[20]
- Trinity & Richland Athlon series CPU – Desktop Budget market: Socket FM2 Athlon X4 730, 740, 750K and 760k CPU models feature the four Piledriver core Trinity microarchitecture but lack on-chip integrated graphics. Athlon X2 340 is dual core model.[21][22][23] Socket FM2 Richland based Athlon X4 760K and Athlon X2 370K CPUs, both with no GPU and four and dual cores respectively were expected.[24][25]
For the server market, three versions were stated to be under development:[26]
- Web serving, Web hosting, and Microserver platform (1 CPU) market: Opteron 3200-series (Zurich; 4 or 8 cores) was to be replaced by Delhi (4 or 8 cores) using the Socket AM3+ format from the Desktop FX-series line. The memory controller was to support dual-channel DDR3 memory configuration.
- Cost/energy efficient server (1 to 2 CPUs) market: Opteron 4200-series (Valencia; 6 or 8 cores) was to be replaced by Seoul (6 or 8 cores). Seoul would continue to use the Socket C32 format. The memory controller would support dual-channel DDR3 memory configuration.
- Enterprise/mainstream server (2 to 4 CPUs) market: Opteron 6200-series (Interlagos; 4, 8, 12, and 16 cores) was to be replaced by Abu Dhabi (4, 8, 12, and 16 cores)). Abu Dhabi will continue to use the Socket G34. The memory controller would support quad-channel DDR3 memory configuration.
APU lines
[edit]- Trinity A-series APU – Desktop Budget and Mainstream market (Virgo platform):[27] The Stars-based Llano Socket FM1 APU line replacements are the 2 and 4 Piledriver core Socket FM2 Trinity Fusion APUs. The A10-5800K, A10-5700, A8-5600K, A8-5500, A6-5400K and A4-5300 APU models were released on 2 October 2012.[28] Trinity processor model numbers ending with the letter "K" denote processors with an unlocked CPU multiplier. The Trinity APU line was praised for its superior integrated graphics performance but underperformed comparable Intel CPU models in most computationally intensive tasks.[29]
- Trinity A-series APU – Notebook Mainstream and Performance market (Comal platform):[27] Notebook computers featuring Trinity APUs shipped as early as June 2012.[30] The mobile Trinity series features four APUs: A10-4600M, A8-4500M, A6-4400M and A4-4300M. In March 2013, AMD announced two more mobile models: A8-4557M and A10-4657M.[31]
In January 2013, AMD officially introduced a new series of APUs codenamed Richland.[32] The series features six new APUs in total. The fastest model, the A10-6800K, featured two Piledriver modules operating at 4.1 GHz and 4.4 GHz in turbo mode and an integrated HD 8670D GPU with 384 stream processors operating at 844 MHz.[33][34] Only the A10-6800K has official DDR3-2133 memory support.[35] The A10-6800K offered approximately 5% performance improvements in performance applications and 3D games over its A10-5800K Trinity based predecessor, largely due to Richland's higher clock speeds and higher overclocking potential than Trinity. On March 12, 2013, AMD officially introduced four Richland mobile APUs.[36] On June 4, 2013, AMD officially announced six Richland desktop APUs.[37][38]
Performance
[edit]In January 2012, Microsoft released two hotfixes (2646060 and 2645594) for Windows 7 and Server 2008 R2 that significantly improved the performance of Clustered Multi-Thread based AMD CPUs by improving thread scheduling.[39][40]
Windows 8 supports CMT-based CPUs out of the box by addressing each core as logical cores and modules as physical cores.
See also
[edit]References
[edit]- ^ Hruska, Joel. "AMD's FX-8350 analyzed: Does Piledriver deliver where Bulldozer fell short?". ExtremeTech. Retrieved 23 March 2013.
- ^ "AMD launches widely anticipated "Trinity" APU". Press release. AMD. 15 May 2012. Retrieved 16 January 2014.
- ^ "New AMD A-Series Processors Bring Faster Speeds, High Core Count and AMD Radeon HD 7000 Series Graphics to Do-It-Yourself PC Enthusiasts and Gamers". AMD. Retrieved 22 March 2013.
- ^ "New AMD FX Line-Up Brings Faster Speeds and Higher Performance Core over Previous Generation to PC Enthusiasts and Gamers". AMD. Retrieved 22 March 2013.
- ^ "New AMD Opteron 4300 and 3300 Series Processors Deliver Ideal Performance, Power and Price for Cloud Applications". Press release. AMD. 4 December 2012. Retrieved 16 January 2014.
- ^ Hruska, Joel. "AMD detonates Trinity: Behold Bulldozer's second coming". ExtremeTech. Retrieved 22 March 2013.
- ^ Walton, Jarred. "The AMD Trinity Review (A10-4600M): A New Hope". AnandTech. Archived from the original on May 17, 2012. Retrieved 22 March 2013.
- ^ "The AMD Trinity Review (A10-4600M): A New Hope". Archived from the original on May 17, 2012.
- ^ Constantin Murariu (30 April 2012). "AMD Trinity Architectural Preview - Part II". Softpedia. Retrieved 6 January 2014.
- ^ Charlie Demerjian (2012-05-25). "Trinity is more than the sum of its parts". SemiAccurate. Retrieved 2013-10-23.
- ^ Gareth Halfacree (21 February 2012). "AMD packs Cyclos clock tech into Piledriver". Retrieved 21 September 2013.
- ^ "AMD's 10 core Piledriver chips revealed | Tech, Tech News". PC Gamer. 2011-07-26. Retrieved 2013-10-23.
- ^ AMD financial analyst day 2010 press kit, Blogs.amd.com, retrieved 2012-01-23
- ^ AMD Cancels Next-Gen Komodo Processor, Corona Platform in Favour of New Chips, X-bit labs, 2012-01-19, archived from the original on 2012-01-12, retrieved 2012-01-23
- ^ "PassMark – Intel Core i7-2600 @ 3.40GHz – Price performance comparison". cpubenchmark.net. Retrieved 11 March 2015.
- ^ "PassMark - AMD FX-8320 Eight-Core - Price performance comparison". cpubenchmark.net. Retrieved 11 March 2015.
- ^ Ilya Gavrichenkov (22 October 2012). "AMD FX-8350 Processor Review: Tuned-Up Bulldozer". Archived from the original on 7 January 2014. Retrieved 7 January 2014.
- ^ "AMD FX-8350 – "Piledriver" for AMD Socket AM3+". 23 October 2012. Retrieved 21 September 2013.
- ^ Sharif Sakr (11 June 2013). "AMD wins race to 5GHz CPU clock speed, in which it was the sole participant". Retrieved 21 September 2013.
- ^ "AMD FX Processors". Retrieved 10 February 2018.
- ^ "AMD Athlon X4 750K - AD750KWOA44HJ / AD750KWOHJBOX". cpu-world.com. Retrieved 11 March 2015.
- ^ "AMD Athlon X2 340 - AD340XOKA23HJ / AD340XOKHJBOX". Cpu-world.com. Retrieved 2013-10-23.
- ^ Sebastian Pop (4 October 2012). "AMD Lists Athlon II X4 700-Series Trinity Processors". softpedia. Retrieved 11 March 2015.
- ^ "AMD Athlon X2 370K - AD370KOKA23HL / AD370KOKHLBOX". Cpu-world.com. Retrieved 2013-10-23.
- ^ "AMD Athlon X4 760K - AD760KWOA44HL / AD760KWOHLBOX". Cpu-world.com. Retrieved 2013-10-23.
- ^ Su, Lisa (2012-02-02). "Consumerization, Cloud, Convergence" (PDF). AMD 2012 Financial Analyst Day. Sunnyvale, California: Advanced Micro Devices. p. 25. Retrieved 2012-02-04.
- ^ a b "AMD Trinity core". cpu-world.com. Retrieved 11 March 2015.
- ^ "AMD Newsroom". amd.com. Retrieved 11 March 2015.
- ^ "AMD Trinity for Desktops. Part 2: Socket FM2 Platform and AMD A10-5800K Processor Review - X-bit labs". Archived from the original on 2013-04-02. Retrieved 2013-03-20.
- ^ Chris Angelini. "AMD Trinity On The Desktop: A10, A8, And A6 Get Benchmarked!". Tom's Hardware. Retrieved 11 March 2015.
- ^ Fudzilla staff. "AMD lists two new Trinity based mobile APUs". fudzilla.com. Retrieved 11 March 2015.
- ^ "AMD Newsroom". amd.com. Retrieved 11 March 2015.
- ^ "AMD A10-Series A10-6800K - AD680KWOA44HL / AD680KWOHLBOX". cpu-world.com. Retrieved 11 March 2015.
- ^ Peter Scott. "More Richland details leak, six parts confirmed". fudzilla.com. Retrieved 11 March 2015.
- ^ Fuad Abazovic. "Top Richland 28nm APU is A10 6800K". fudzilla.com. Retrieved 11 March 2015.
- ^ "AMD intros 35W Richland mobile APUs". techreport.com. Retrieved 11 March 2015.
- ^ "AMD A-SERIES APUs: Made for Combat. Ready for War". Sites.amd.com. 2013-10-06. Retrieved 2013-10-23.
- ^ "Specifications of upcoming AMD Richland APUs". Cpu-world.com. Retrieved 2013-10-23.
- ^ An update is available for computers that have an AMD FX, AMD Opteron 4200, AMD Opteron 6200, or AMD Bulldozer series processor installed and that are running Windows 7 or Windows Server 2008 R2, support.microsoft.com, January 2012, retrieved 2014-02-11
- ^ An update that selectively disables the Core Parking feature in Windows 7 or in Windows Server 2008 R2 is available, support.microsoft.com, January 2012, retrieved 2014-02-11
Piledriver (microarchitecture)
View on GrokipediaArchitecture
Core Design
The Piledriver microarchitecture employs a module-based design inherited from its Bulldozer predecessor, where each module consists of two integer cores that share a single floating-point unit (FPU), a 64 KB L1 instruction cache, and a 2 MB L2 cache. This shared structure aims to optimize die area and power efficiency by centralizing certain resources, while each integer core maintains independent L1 data caches (16 KB each) and execution units for scalar integer operations. The shared FPU handles both cores' floating-point and vector workloads, enabling simultaneous processing but potentially introducing contention in multi-threaded scenarios.[1] The integer pipeline in Piledriver is 4-wide (two ALUs and two AGUs per core), similar to Bulldozer but with refined out-of-order scheduling with a 40-entry integer scheduler and 96 physical registers per core, supporting better instruction throughput and reduced stalls compared to the prior architecture. Each core features four execution pipelines: two arithmetic logic units (ALUs) for general operations, one for multiplication and jumps, and address generation units for memory accesses, enabling dual-issue integer execution when both cores are active.[1][4] Floating-point capabilities are centralized in the shared FPU, which supports 128-bit vector operations via SSE and AVX instructions, with the ability to process two 128-bit or one 256-bit operation per clock cycle. Key enhancements include support for fused multiply-add (FMA3 and FMA4) instructions, achieving full throughput of two 128-bit FMAs per cycle with 5-6 cycle latency for add and multiply operations. The FPU comprises four pipelines dedicated to floating-point arithmetic, division, and vector integer tasks, balancing scalar and vector workloads effectively.[1][5] Branch prediction in Piledriver features a Level-1 branch target buffer (BTB) with 512 entries (128 sets, 4-way associative), with slightly improved prediction accuracy over Bulldozer due to hybrid predictor refinements. A perceptron-based predictor enhances accuracy for complex conditional patterns, while a Level-2 BTB of 5120 entries (1024 sets, 5-way) serves as a backup for longer histories. The frontend supports 4-way decoding in a shared stage, alternating between the two threads in a module, with register renaming tied to the out-of-order schedulers; the floating-point scheduler holds 60 entries with 160 registers. This setup allows for 1-2 taken branches per cycle but incurs a high misprediction penalty due to the 19-stage pipeline depth.[1]Cache and Memory Subsystem
The Piledriver microarchitecture, part of AMD's Family 15h processors, features a multi-level cache hierarchy designed to balance latency and bandwidth in multi-core configurations, with optimizations inherited from its Bulldozer predecessor but refined for improved efficiency in shared module designs. Each compute module consists of two integer cores that share certain cache resources to reduce die area while maintaining per-core data access. The primary caches are split into instruction and data components, with the instruction cache shared across the module to support fused operations common in integer-heavy workloads.[6][1] The level-1 (L1) instruction cache (L1I) measures 64 KB per module and is 2-way set-associative with a 64-byte line size, enabling low-latency fetches of up to 32 bytes per cycle for the shared floating-point scheduler. In contrast, each core has its own dedicated 16 KB L1 data cache (L1D), which is 4-way set-associative and also uses 64-byte lines, providing two 128-bit read/write ports for efficient load/store operations with a load-to-use latency of 3-4 clock cycles. This asymmetric design prioritizes data cache proximity to execution units while sharing instruction fetches, a carryover from Bulldozer that minimizes redundancy in the module but requires careful coherence management.[6][1] The level-2 (L2) cache totals 2 MB per module, shared between the two cores, and employs 16-way set associativity with 64-byte lines for a capacity that supports larger working sets than L1. It operates as a mostly exclusive victim cache relative to the L1D, meaning evicted L1 data lines populate the L2 without full inclusion of L1 contents, which helps reduce latency for repeated accesses but demands snoop traffic for coherence. Load-to-use latency stands at approximately 20-21 clock cycles, with read throughput of one access every 4 cycles and write throughput limited to one every 12 cycles, enhancements over Bulldozer that include better prefetching to boost hit rates in bandwidth-sensitive scenarios.[6][1] At the last level, the L3 cache provides up to 8 MB shared across all modules on the die (varying by model, with some implementations at 2-6 MB), implemented as a non-inclusive victim cache with 64-way set associativity and 64-byte lines to capture inter-module data sharing. This design installs lines evicted from any L2 cache, improving overall hit rates for multi-threaded applications by acting as a global filter, though it introduces higher latency of about 87 clock cycles for accesses, with read throughput of one every 15 cycles and write every 21 cycles. The L3 serves as the coherency point for on-die communication, incorporating a snoop filter to minimize unnecessary probes between modules, a foundational approach that prefigures the scalable Infinity Fabric interconnect in later AMD architectures.[6][1] Piledriver integrates a dual-channel memory controller supporting DDR3 up to 1866 MT/s, delivering high bandwidth of up to 29.9 GB/s per channel for low-latency access in NUMA-configured systems, with optimizations like on-die termination to reduce signal reflections. For external connectivity, it employs HyperTransport 3.0 links at up to 6.4 GT/s, providing up to four non-coherent tunnels for I/O and coherent links for multi-socket server configurations, ensuring scalable bandwidth while integrating with the on-die L3 for holistic memory subsystem performance. In server variants like Opteron, HyperTransport 3.0 provides up to four links for multi-socket configurations.[6][1]Features
Instruction Set Extensions
The Piledriver microarchitecture provides full support for the x86-64 instruction set architecture, including baseline compatibility with SSE, SSE2, SSE3, SSSE3, and SSE4.2 extensions for enhanced multimedia and string processing operations.[1] It also incorporates AES-NI (Advanced Encryption Standard New Instructions) for accelerating cryptographic workloads through dedicated hardware instructions like AESENC and AESDEC.[1] Piledriver introduces an enhanced implementation of AVX (Advanced Vector Extensions), utilizing 256-bit vector registers (YMM0-YMM15) to enable wider SIMD processing for floating-point operations. The floating-point execution unit supports one 256-bit AVX instruction per clock cycle, equivalent to two 128-bit operations, with no frequency penalty when mixing AVX and legacy SSE instructions. Additionally, Piledriver supports FMA3 and FMA4 fused multiply-add instructions, allowing the execution unit to perform two 128-bit FMAs per cycle in a double-pumped configuration, delivering up to 8 double-precision FLOPs per cycle per module.[1] This enhancement targets high-performance computing tasks requiring intensive vector arithmetic.[7] The architecture includes BMI1 (Bit Manipulation Instruction Set 1) extensions, providing instructions such as ANDN, BEXTR, BLSI, and TZCNT for efficient bit-level operations in algorithms like hashing and compression. Piledriver also supports F16C (Half-Precision Floating-Point Conversion) instructions, including VCVTPS2PH and VCVTPH2PS, to accelerate conversions between 16-bit and 32-bit floating-point formats, beneficial for graphics and machine learning applications.[8][7] For virtualization, Piledriver incorporates AMD-V with Nested Paging (also known as Rapid Virtualization Indexing or RVI), which optimizes virtual-to-physical address translations to reduce overhead in hypervisor environments. Specific integer instructions like POPCNT (population count) and LZCNT (leading zero count) are fully supported, enhancing performance in workloads involving bit counting and normalization, such as data compression and scientific simulations.[1] These extensions collectively ensure broad compatibility while boosting efficiency in parallel and vectorized code.[9]Power and Thermal Management
Piledriver incorporates dynamic clocking mechanisms to optimize performance while managing power draw, primarily through AMD's Core Performance Boost (CPB), formerly known as Turbo CORE technology. This feature enables select models to achieve boost clocks up to 4.2 GHz under light thread loads, allowing the processor to exceed base frequencies when thermal and power headroom permits, thereby balancing single-threaded workloads without exceeding predefined limits.[10] To minimize idle power consumption, Piledriver employs granular power gating and clock gating techniques applied per module. Power gating dynamically isolates inactive circuit blocks by cutting off their supply voltage, while extensive clock gating at the flip-flop level prevents unnecessary clock toggling in idle states, significantly reducing both dynamic and leakage power in multi-module configurations.[11] Thermal management in Piledriver relies on integrated on-die sensors that monitor junction temperatures in real-time, triggering throttling when approaching Tjmax thresholds typically ranging from 90°C to 105°C to prevent damage and maintain reliability. This system integrates with the evolved AMD Cool'n'Quiet technology, which extends dynamic frequency and voltage scaling from prior architectures, incorporating selective core parking in multi-module chips to idle underutilized cores during low-demand scenarios, further enhancing efficiency.[12][13] Piledriver's thermal design power (TDP) is configurable across a range of 65 W to 125 W for desktop and client variants, enabling adaptation to different cooling solutions and workloads, while server implementations extend up to 140 W to support higher core counts and sustained operations.[14]Implementations
Desktop Processors
The desktop implementations of the Piledriver microarchitecture were embodied in the AMD FX series under the Vishera codename, targeting high-performance consumer computing without integrated graphics. These processors utilized a 32 nm silicon-on-insulator (SOI) process node, with the 8-core variants featuring a die size of 315 mm² and approximately 1.2 billion transistors.[15] They maintained compatibility with the Socket AM3+ platform, supporting dual-channel DDR3 memory up to 1866 MT/s. A key feature of the FX lineup was the unlocked CPU multiplier, enabling straightforward overclocking for enthusiasts via BIOS adjustments without specialized hardware.[16] Representative models included the flagship 8-core FX-8350, which operated at a 4.0 GHz base clock and up to 4.2 GHz turbo boost under lighter loads, with a thermal design power (TDP) of 125 W. For mid-range options, the 4-core FX-4350 provided a higher base clock of 4.2 GHz at a 125 W TDP, balancing performance and efficiency for budget builds. Production of Vishera-based FX processors concluded around 2015, as AMD shifted focus to newer architectures and platforms without pursuing a process shrink to 22 nm. This marked the end of discrete high-end desktop CPU development on the AM3+ socket, with subsequent efforts emphasizing accelerated processing units (APUs).Mobile and APU Processors
The Piledriver microarchitecture found its initial mobile implementation in the Trinity platform, launched in 2012, which integrated four Piledriver cores into APUs targeted at mainstream laptops. A representative model, the A10-4600M, featured two Piledriver modules for a total of four integer cores clocked at a 2.3 GHz base frequency with a 3.2 GHz turbo boost, while maintaining a 35 W TDP suitable for mobile thermal constraints.[17] These APUs emphasized balanced CPU and graphics performance for everyday computing and light gaming in portable devices.[18] In 2013, AMD refreshed the mobile lineup with the Richland platform, retaining the Piledriver cores but incorporating optimizations for improved efficiency and graphics capabilities.[19] The flagship A10-5750M, for instance, delivered four Piledriver cores at 2.5 GHz base and up to 3.5 GHz turbo, paired with a 35 W TDP, enabling better multitasking and media playback in slim laptops compared to Trinity.[20] Other models like the A8-5550M (four cores at 2.1 GHz) and dual-core variants such as the A6-5350M targeted entry-level mobile segments, all built on the same 32 nm process.[21] Piledriver-based mobile APUs utilized dedicated sockets for laptop integration, with both Trinity and Richland employing the FS1r2 package.[22] These sockets facilitated dual-channel DDR3L-1600 memory support, prioritizing low-voltage operation for extended battery life in mobile environments without compromising bandwidth for integrated graphics.[23] // Note: This source is for similar, but adjusted for Richland context from analogous reports. Central to these mobile APUs was the integrated Radeon graphics, leveraging a VLIW4 architecture with up to 384 shaders to deliver DirectX 11-compatible rendering for 1080p video and casual gaming. In Trinity, the Radeon HD 7660G in models like the A10-4600M operated at up to 686 MHz, providing enhanced pixel and texture throughput over prior generations.[24] Richland's Radeon HD 8650G further refined this with higher clocks up to 720 MHz and improved power gating, boosting frame rates in DirectX 11 titles by approximately 20% over Trinity equivalents.[19]Server Processors
The AMD Opteron 4300 and 6300 series processors, released in 2012, represented the primary server implementations of the Piledriver microarchitecture, emphasizing scalability for multi-socket configurations in enterprise environments. The Opteron 4300 series, codenamed Seoul and fabricated on a 32 nm process, supported up to 8 cores (organized as 4 modules) with base clock speeds reaching 3.1 GHz and thermal design power (TDP) ratings up to 95 W, utilizing the C32 socket for single- and dual-socket systems.[25] These processors featured up to two x16 HyperTransport 3.0 links operating at up to 6.4 GT/s, enabling efficient interconnectivity in power-constrained deployments such as cloud and web serving.[25] In contrast, the Opteron 6300 series, codenamed Abu Dhabi and also on 32 nm, scaled to up to 16 cores (8 modules) with base frequencies up to 2.8 GHz and TDP up to 140 W, designed for the G34 socket in multi-socket servers.[26] This series supported up to four sockets in a non-uniform memory access (NUMA) configuration, interconnected via multiple HyperTransport 3.0 links at speeds up to 6.4 GT/s per link, facilitating high-bandwidth data sharing across nodes. Both series integrated dual-channel DDR3 memory controllers with error-correcting code (ECC) support, including advanced reliability, availability, and serviceability (RAS) features such as single- and multi-bit error correction to enhance data integrity in mission-critical applications.[27] Piledriver-based Opteron processors were phased out by 2016, supplanted by the Steamroller microarchitecture in subsequent server products to address evolving demands for higher efficiency and integration.[28]Development and Releases
Origins from Bulldozer
The Piledriver microarchitecture emerged as AMD's direct successor to the Bulldozer architecture, with early development focused on refining the modular core design to overcome Bulldozer's limited instructions per clock (IPC) gains, which had only achieved modest improvements over prior generations. Announced in late 2010 at AMD's Financial Analyst Day, Piledriver was positioned as an enhanced iteration—often referred to informally as "Bulldozer 2.0"—aiming for approximately 10-15% IPC uplift through targeted optimizations while preserving the scalable module structure that allowed for efficient multi-core configurations across desktop, mobile, and server segments.[29][30] This approach sought to balance single-threaded performance enhancements with the multi-threaded parallelism that defined Bulldozer's philosophy, enabling better resource sharing without sacrificing overall system scalability.[31] A core evolutionary decision in Piledriver's design was to retain the shared floating-point unit (FPU) per module, which continued to support wide 256-bit vector operations for workloads like multimedia and scientific computing, while emphasizing greater independence for the integer execution units to improve per-thread efficiency. Each module still housed two integer cores with dedicated schedulers and load/store units, allowing them to operate more autonomously from the shared frontend and FPU compared to Bulldozer's tighter coupling, thereby addressing bottlenecks in integer-heavy tasks without overhauling the overall module footprint.[5] This incremental refinement maintained compatibility with existing manufacturing processes and software ecosystems, prioritizing evolutionary stability over radical reconfiguration. Development of Piledriver began with prototyping efforts in 2010, aligning with AMD's post-Bulldozer planning phase, and culminated in tape-out during late 2011 at GlobalFoundries' 32 nm silicon-on-insulator (SOI) facility. However, the project faced delays stemming from persistent yield challenges on the 32 nm SOI process, which GlobalFoundries struggled to ramp up efficiently, impacting production timelines and contributing to staggered releases in 2012.[32][33] These issues echoed broader manufacturing hurdles encountered with Bulldozer but were mitigated through process tweaks, ensuring Piledriver's viability on the same node.[34]Major Product Launches
The first Piledriver-based products were the Trinity APUs, launched in May 2012 for mobile platforms as part of AMD's second-generation A-Series, marking the debut of Piledriver silicon in A-series desktop and mobile processors.[35] Desktop variants of the Trinity APUs followed in October 2012, introducing the FM2 socket for enhanced platform compatibility and future upgrade paths in consumer systems.[36] In October 2012, AMD released the Vishera FX-series desktop CPUs, refreshing the AM3+ platform with Piledriver cores to target high-performance computing enthusiasts.[37] The server segment saw Piledriver integration in Q4 2012 with the Opteron 3300 (Delhi), 4300 (Seoul), and 6300 (Abu Dhabi) series processors, announced in late 2012 with availability starting in December to support dual- and quad-socket configurations for data centers and high-performance computing.[25][38] AMD updated its APU lineup in June 2013 with the Richland series, incorporating minor core optimizations to Piledriver for improved efficiency in desktop and mobile A-series products.[39] Low-power Piledriver variants launched on May 23, 2013 via the Kabini and Temash APUs, fabricated on a 28 nm process to enable thin-and-light tablets and ultrabooks with integrated graphics.[40] The FM2 socket, introduced alongside desktop Trinity APUs, facilitated seamless upgrades across Piledriver-based platforms, including support for subsequent Richland processors without requiring a full system replacement.[41] In 2015, AMD released the Godavari APUs as a minor refresh of Piledriver for the FM2+ socket. As of 2025, open-source firmware updates continue to support Piledriver platforms.[42]Performance
Architectural Improvements
Piledriver introduced several key architectural enhancements over the Bulldozer microarchitecture, primarily aimed at increasing instructions per cycle (IPC) through refined scheduling mechanisms and reduced pipeline stalls. The design achieved an average IPC uplift of 10-15% in desktop workloads, attributed to optimizations in the integer pipeline, including a unified scheduler per core and physical register renaming, which allowed for wider dispatch of up to four integer instructions per cycle compared to Bulldozer's more constrained approach.[43] These changes minimized resource conflicts within the shared module design, enabling more efficient execution of integer-heavy tasks and yielding approximately 20% better performance in SPECint benchmarks for representative integer workloads. Branch prediction saw significant refinement with the addition of an augmented hybrid predictor featuring a second-level predictor, loop predictor, and way predictor, which collectively reduced prediction inaccuracies and pipeline flushes. This improvement lowered the overall branch misprediction rate, contributing to the IPC gains by allowing the frontend to sustain higher fetch and decode rates without frequent disruptions.[43] In the floating-point domain, Piledriver improved AVX performance through dual 128-bit pipes supporting up to one 256-bit vector operation per cycle (by splitting into two 128-bit micro-ops) and new ISA extensions like FMA3 for fused multiply-add efficiency, with enhancements in scheduling over Bulldozer.[1][44] Fabricated on the same 32 nm SOI process as Bulldozer, Piledriver benefited from manufacturing yield improvements that enabled higher clock speeds, with desktop variants sustaining boosts up to 4.2 GHz base and 4.7 GHz turbo, directly amplifying the architectural IPC gains into overall performance.[43]Benchmark Comparisons
In benchmark evaluations, the Piledriver microarchitecture, exemplified by the AMD FX-8350, trailed Intel's Ivy Bridge Core i7-3770K by approximately 25% in single-threaded SPEC CPU2006 integer performance, reflecting ongoing challenges in instructions per clock (IPC) efficiency, though multi-threaded rates were more competitive owing to the eight-core configuration. The Cinebench R11.5 multi-core test demonstrated Piledriver's strengths in parallel workloads, where an eight-core FX-8350 achieved scores matching a four-core Ivy Bridge i7-3770K, as the shared floating-point units in Piledriver modules allowed effective scaling in rendering simulations despite lower per-core throughput.[10] Gaming performance highlighted Piledriver's limitations in CPU-intensive scenarios; at equivalent clock speeds, the FX-8350 delivered 10-20% lower frame rates than the i7-3770K in titles like Battlefield 3, where single-threaded execution dominated, such as in AI and physics calculations.[45] Power efficiency comparisons revealed Piledriver's 32 nm process node disadvantage against Intel's 22 nm Ivy Bridge, with the FX-8350 requiring roughly 1.5 times higher energy consumption per unit of performance in mixed workloads, contributing to elevated thermal demands under load.[46] For server applications, AMD's Piledriver-based Opteron 6300 series processors underperformed Intel's Ivy Bridge-EP Xeon E5-2600 v2 by around 20-40% in high-performance computing benchmarks like ANSYS Fluent simulations, but offered advantages in virtualized environments due to higher core counts.[47] As of 2025, open-source firmware updates have improved boot times to 15 seconds even with 256 GB RAM configurations, enhancing legacy server usability.[42]References
- https://en.wikichip.org/wiki/amd/List_of_AMD_CPU_sockets
- https://en.wikichip.org/wiki/amd/packages/socket_fm2
