Hubbry Logo
Project DenverProject DenverMain
Open search
Project Denver
Community hub
Project Denver
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Project Denver
Project Denver
from Wikipedia
Nvidia Denver 1/2
General information
Launched2014 (Denver)
2016 (Denver 2)
Designed byNvidia
Cache
L1 cache192 KiB per core
(128 KiB I-cache with parity, 64 KiB D-cache with ECC)
L2 cacheMiB @ 2 cores
Architecture and classification
Technology node28 nm (Denver 1) to 16 nm (Denver 2)
Instruction setARMv8-A
Physical specifications
Cores
  • 2
Nvidia Carmel
General information
Launched2018
Designed byNvidia
Max. CPU clock rateto 2.3 GHz 
Cache
L1 cache192 KiB per core
(128 KiB I-cache with parity, 64 KiB D-cache with ECC)
L2 cacheMiB @ 2 cores
L3 cache(4 MiB @ 8 cores, T194[1])
Architecture and classification
Technology node12 nm
Instruction setARMv8.2-A
Physical specifications
Cores
  • 2
For the Soviet HIV disinformation campaign, see Operation Denver.

Project Denver is the codename of a central processing unit designed by Nvidia that implements the ARMv8-A 64/32-bit instruction sets using a combination of simple hardware decoder and software-based binary translation (dynamic recompilation) where "Denver's binary translation layer runs in software, at a lower level than the operating system, and stores commonly accessed, already optimized code sequences in a 128 MB cache stored in main memory".[2] Denver is a very wide in-order superscalar pipeline. Its design makes it suitable for integration with other SIPs cores (e.g. GPU, display controller, DSP, image processor, etc.) into one die constituting a system on a chip (SoC).

Project Denver is targeted at mobile computers, personal computers, servers, as well as supercomputers.[3] Respective cores have found integration in the Tegra SoC series from Nvidia. Initially Denver cores was designed for the 28 nm process node (Tegra model T132 aka "Tegra K1"). Denver 2 was an improved design that built for the smaller, more efficient 16 nm node. (Tegra model T186 aka "Tegra X2").

In 2018, Nvidia released an improved design (codename: "Carmel", based on ARMv8 (64-bit; variant: ARM-v8.2[1] with 10-way superscalar, functional safety, dual execution, parity & ECC) got integrated into the Tegra Xavier SoC offering a total of 8 cores (or 4 dual-core pairs).[4][failed verification] The Carmel CPU core supports full Advanced SIMD (ARM NEON), VFP (Vector Floating Point), and ARMv8.2-FP16.[1] First published testings of Carmel cores integrated in the Jetson AGX development kit by third party experts took place in September 2018 and indicated a noticeably increased performance as should expected for this real world physical manifestation compared to predecessors systems, despite all doubts the used quickness of such a test setup in general an in particular implies.[5] The Carmel design can be found in the Tegra model T194 ("Tegra Xavier") that is designed with a 12 nm structure size.

Overview

[edit]
  • Pipelined in-order superscalar processor
  • 2-way decoder for ARM instructions
  • On-the-fly binary translation of ARM code into internal VLIW instructions by hardware translator, uses software emulation as fallback
  • Translation can reorder ARM instructions, and remove ones that do not contribute to the result[2]
  • Up to 7 micro-ops per clock cycle with translated VLIW instructions; cannot run simultaneously with ARM decoder
  • L1 cache: 128 KiB instruction + 64 KiB data per core (4-way set associative)
  • MiB shared L2 cache between two Denver cores (16-way set-associative)[6]
  • Denver also sets aside 128 MiB of main memory to store translated VLIW code; this part of memory is inaccessible to the main operating system.
  • Up to 2.5 GHz clockspeeds on TSMC 28 nm process[7]

Chips

[edit]

A dual-core Denver CPU was paired with a Kepler-based GPU solution to form the Tegra K1; the dual-core 2.3 GHz Denver-based K1 was first used in the HTC Nexus 9 tablet, released November 3, 2014.[8][9] Note, however, that the quad-core Tegra K1, while using the same name, isn't based on Denver.

The Nvidia Tegra X2 has two Denver2 cores paired with four Cortex-A57 cores using a coherent HMP (Heterogeneous Multi-Processor Architecture) approach.[10] They are paired with a Pascal GPU.

The Tegra Xavier has a Volta GPU and several special purpose accelerators. The 8 Carmel CPU cores is divided into 4 ASIC macro blocks (each having 2 cores,) matched to each other with a crossbar and 4 MiB of shared L3 memory.

History

[edit]

The existence of Project Denver was revealed at the 2011 Consumer Electronics Show.[11] In a March 4, 2011 Q&A article CEO Jen-Hsun Huang revealed that Project Denver is a five-year 64-bit ARMv8-A architecture CPU development on which hundreds of engineers had already worked for three and half years and which also has 32-bit ARM instruction set (ARMv7) backward compatibility.[12] Project Denver was started in Stexar Company (Colorado) as an x86-compatible processor using binary translation, similar to projects by Transmeta. Stexar was acquired by Nvidia in 2006.[13][14][15]

According to Tom's Hardware, there are engineers from Intel, AMD, HP, Sun and Transmeta on the Denver team, and they have extensive experience designing superscalar CPUs with out-of-order execution, very long instruction words (VLIW) and simultaneous multithreading (SMT).[16]

According to Charlie Demerjian, the Project Denver CPU may internally translate the ARM instructions to an internal instruction set, using firmware in the CPU.[17] Also according to Demerjian, Project Denver was originally intended to support both ARM and x86 code using code morphing technology from Transmeta, but was changed to the ARMv8-A 64-bit instruction set because Nvidia could not obtain a license to Intel's patents.[17]

The first consumer device shipping with Denver CPU cores, Google's Nexus 9, was announced on October 15, 2014. The tablet was manufactured by HTC and features the dual-core Tegra K1 SoC. The Nexus 9 was the first 64-bit Android device available to consumers.[18]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Project Denver is the codename for a custom (CPU) developed by , implementing the ARMv8-A 64-bit and 32-bit instruction sets, and first realized in the Tegra K1-64 system-on-chip (SoC) in 2014. Announced on January 5, 2011, at the Consumer Electronics Show (CES) in , Project Denver represented NVIDIA's strategic initiative to design high-performance ARM-based CPU cores integrated with its (GPU) technology on a single chip, targeting applications from personal computers and mobile devices to servers, workstations, and supercomputers. The project stemmed from a with , licensing the ARM Cortex-A15 core for initial Tegra mobile processors while developing proprietary cores to leverage ARM's energy-efficient architecture for the emerging "Internet Everywhere" era of computing. The microarchitecture features a dual-core, 7-wide superscalar, out-of-order design fabricated on a 28-nm high-performance mobile (HPM) process by , with clock speeds reaching up to 2.5 GHz. Key innovations include Dynamic Code Optimization (DCO), which profiles and recompiles frequently executed ("hot") code regions at runtime to double performance by reducing instruction latency and enabling advanced optimizations like run-ahead execution and prefetching, mitigating cache-miss penalties by up to 60% in floating-point workloads. It supports both (64-bit) and AArch32 (32-bit) modes, with a comprising 64 KB L1 data cache, 128 KB L1 instruction cache per core, and a shared 2 MB L2 cache, alongside seven execution units (two integer, two floating-point/, two load/store, and one branch). In the Tegra K1-64, Denver delivered peak throughput exceeding seven instructions per cycle with DCO enabled, achieving 3x the double-precision floating-point performance of the Cortex-A15 and 87% higher MIPS per watt compared to competitors like the APQ8084 at similar power levels (around 4W). Power efficiency was further enhanced by features like the CC4 low-voltage retention state, allowing cores to maintain state at reduced voltage for quick resumption. Although Project Denver marked NVIDIA's entry into custom CPU design—initially conceptualized with x86 elements before pivoting to due to licensing constraints—the architecture influenced subsequent NVIDIA efforts in mobile and , underscoring the shift toward heterogeneous CPU-GPU integration.

Introduction

Overview

Project Denver is the codename for NVIDIA's custom (CPU) core that implements the ARMv8-A instruction set architecture, supporting both 64-bit () and 32-bit (AArch32) modes for full compatibility. The core purpose of Project Denver is to combine the energy efficiency characteristic of processors—traditionally dominant in mobile devices—with the computational demands of personal computers and servers, achieved through tightly integrated CPU and GPU designs that leverage NVIDIA's expertise in parallel processing. This initiative targets a broad spectrum of applications, from tablets and personal computers to servers and supercomputers, enabling scalable performance across diverse computing environments. By developing its own -compatible CPU, extends the ARM ecosystem beyond low-power mobile applications into , fostering innovations in architectures.

Objectives and Scope

Project Denver was initiated by with the strategic objective of developing high-performance, energy-efficient central processing units (CPUs) based on the architecture to challenge the dominance of x86 processors in personal computers, servers, and supercomputing environments. This effort aimed to leverage ARM's reduced instruction set computing (RISC) design principles to deliver superior power efficiency while maintaining competitive performance levels across diverse computing platforms. The scope of Project Denver extended beyond initial mobile system-on-chips (SoCs) in the series, evolving toward integrated hybrid CPU-GPU architectures intended for widespread adoption in both consumer and enterprise applications, including tablets, workstations, and cloud infrastructure. Through a strategic partnership with , secured an architectural license to create fully custom CPU cores based on the , enabling tailored optimizations for advanced computing needs. Anticipated benefits encompassed enhanced power efficiency to address the inefficiencies of traditional x86 systems, scalability for emerging workloads such as graphics processing and data analytics, and deep ecosystem integration with NVIDIA's parallel GPU technologies for accelerated computing. These features positioned Project Denver as a foundational step toward heterogeneous computing paradigms that combine general-purpose processing with specialized acceleration.

History

Origins and Announcement

Prior to the official launch of Project Denver, NVIDIA explored developing an x86-compatible CPU in the late , licensing 's technology—a RISC-based design intended for low-power translation of x86 instructions—to target server and markets. Rumors of this x86 development using Transmeta technology emerged in late . This effort, which began quietly around 2007, marked NVIDIA's initial foray into general-purpose CPU design, aiming to leverage Transmeta's expertise in efficient x86 emulation for competitive entry into . Due to legal challenges associated with x86 intellectual property, the project pivoted to architecture. On January 5, 2011, at the Consumer Electronics Show (CES) in , publicly announced Project Denver as an initiative to design custom high-performance ARM-based CPU cores, integrated with its GPUs on a single chip. The announcement highlighted 's ambition to challenge x86 dominance in computing by harnessing ARM's low-power efficiency and open ecosystem for applications spanning personal computers, servers, workstations, and supercomputers. CEO Jen-Hsun Huang emphasized the project's role in enabling "Internet Everywhere" devices with advanced operating systems and capabilities. To support this endeavor, formed a dedicated CPU design group, building on its internal efforts, and secured an architecture license from to develop proprietary cores based on future ARM instruction sets. This investment extended to the broader ARM ecosystem, including licensing the Cortex-A15 processor for initial integrations, positioning to innovate within ARM's growing influence in high-end computing.

Development Challenges and Architectural Shift

Following the 2011 announcement of Project Denver, NVIDIA encountered significant legal constraints stemming from its earlier licensing of 's x86 , particularly the technology designed for translating x86 code into a RISC instruction set. These issues, which arose amid broader x86 patent litigations in the industry, ultimately forced NVIDIA to abandon its original x86-based plans for the processor. As former Transmeta executive Dave Ditzel noted, "It originally started as an x86 but through certain legal issues, had to turn itself into an CPU." Following the pivot from x86, Project Denver was publicly announced as ARM-based in 2011, with a commitment to the ARMv8 instruction set by 2012 to enable 64-bit compatibility while leveraging its expertise in GPU integration for . This redesign transformed Project Denver into a custom ARMv8-A CPU core, emphasizing dynamic code optimization (DCO) to bridge ARM's mobile heritage with server-grade performance needs. The transition presented notable technical challenges, as was primarily optimized for low-power mobile applications, requiring adaptations for high-performance workloads. Key hurdles included managing power efficiency in a superscalar, model, where traditional designs incurred high energy costs and complexity; NVIDIA addressed this through DCO, which optimized hot code paths to deliver over seven while reducing branch misprediction penalties by up to 37% compared to contemporary cores like the Cortex-A15. issues arose in balancing core with and power budgets, particularly for integration with NVIDIA's GPU architectures, necessitating innovations like the CC4 retention state to lower voltage during short idle periods under 100 ms. Validation of the tightly coupled hardware-software system also proved complex, relying on extensive cosimulation to ensure reliability across AArch32 and modes.

Design and Architecture

Microarchitecture Details

The Denver microarchitecture employs a dual-issue in-order pipeline as its core execution model, capable of natively dispatching up to two ARM instructions per cycle, while achieving out-of-order-like performance through dynamic code optimization (DCO) that translates and optimizes guest ARM code into native micro-operations for superscalar execution. This DCO mechanism simulates out-of-order execution by enabling register renaming, loop unrolling, load hoisting, and redundancy elimination in translated code blocks, stored in a dedicated optimization cache to boost throughput beyond the hardware's in-order limitations. The design supports the full ARMv8-A instruction set architecture, including AArch64 for 64-bit addressing, AArch32 compatibility mode, and extensions for virtualization, cryptography, and advanced SIMD (NEON). The integer pipeline comprises 15 stages, structured to minimize load-use dependencies through a skewed design that delays reads by three cycles after L1 data cache access, facilitating efficient load-ALU-store bundling and intrabundle forwarding. Branch misprediction incurs a 13-cycle penalty, addressed by an advanced predictor incorporating a global history buffer, target buffer, return address stack, and indirect target predictor, which achieves up to 37% lower mispredict rates compared to contemporary cores like Cortex-A15. The execution backend features seven wide superscalar units, including two integer ALUs (one with multiplier support), two 128-bit FP/ units, two load/store units, and a dedicated unit, enabling peak dispatch of seven micro-operations per cycle under DCO. Cache hierarchies are configured for balanced latency and capacity in power-constrained environments, with a 128 KB four-way set-associative L1 instruction cache, a 64 KB four-way L1 cache (three-cycle load-to-use latency), and a shared 2 MB 16-way L2 cache per dual-core cluster (18-cycle latency). Translation lookaside buffers include a 128-entry four-way I-TLB, a 256-entry eight-way D-TLB supporting multiple page sizes, and a 2048-entry L2 TLB, complemented by a hardware tracking up to 32 streams to mitigate misses in irregular access patterns. The initial implementation targeted the 28 nm HPM process node, with clock speeds ranging from 1 GHz in low-power modes to up to 2.5 GHz for peak performance.

Key Innovations and Features

Project Denver introduced several innovative features that extended beyond the standard ARMv8 architecture, focusing on performance optimization, , and efficiency tailored for NVIDIA's SoCs. A innovation was its dynamic code optimization (DCO) mechanism, which employed a just-in-time () compiler to translate and optimize frequently executed ARM code regions on-the-fly. This approach identified "hot" code paths during runtime, recompiling them into more efficient micro-operations that reduced branch mispredictions and instruction redundancies, achieving up to 7 in optimized workloads. The CPU-GPU synergy in Project Denver represented a significant advancement in , with the Denver cores tightly integrated alongside within the . This on-chip architecture facilitated low-latency data sharing and unified memory access, enabling seamless task offloading between the CPU and GPU for compute-intensive applications like graphics rendering and parallel processing. By leveraging , the design supported direct CPU-to-GPU communication without external interfaces, enhancing overall system throughput in mobile and embedded scenarios. Power efficiency was another key focus, incorporating adaptive voltage scaling and fine-grained optimized for battery-powered devices. The adaptive voltage scaling dynamically adjusted supply voltages based on demands, entering low-power states like CC4 during idle periods to minimize leakage while maintaining quick resumption. Complementing this, fine-grained disabled clocks to inactive pipeline stages and peripherals, achieving linear power scaling and 87% higher MIPS per watt compared to the Qualcomm APQ8084 at similar power levels. Security extensions in Project Denver built upon TrustZone by integrating NVIDIA-specific hardware of trust mechanisms. This included secure processes rooted in immutable and fused keys, ensuring authenticated code execution within isolated TrustZone environments to protect sensitive operations from software attacks. The hardware of trust protected optimized regions against changes due to coherent I/O or CPU traffic, providing a robust foundation for in Tegra-based systems.

Implementations

Tegra K1 Integration

The Tegra K1-64 represented the inaugural commercial integration of Project Denver cores into NVIDIA's mobile system-on-chip lineup, announced in January 2014 alongside the broader Tegra K1 family at CES. This 64-bit variant featured NVIDIA's custom-designed CPU architecture, marking a shift from off-the-shelf cores to in-house development for enhanced performance in . Architectural details of the Denver integration were further elaborated in August 2014, highlighting its and superscalar design for superior single-threaded efficiency. The chip began shipping in consumer devices later that year, with the 9 tablet serving as the flagship example, released in October 2014. At its core, the Tegra K1-64 employed a dual-core configuration clocked up to 2.5 GHz, paired with a 192-core Kepler GPU derived from NVIDIA's desktop graphics architecture to deliver PC-level rendering capabilities in a compact form. This setup supported advanced features like DirectX 11 and OpenGL 4.4, enabling high-fidelity gaming and on mobile platforms. Manufactured on TSMC's 28 nm HPM process, the SoC maintained a low-power of approximately 5-10 W, optimized for battery-constrained environments while balancing compute demands. The cores, building on the detailed in prior phases, provided a 64-bit ARMv8 execution model with 7-way superscalar pipelines for improved instruction throughput. Beyond tablets, the Tegra K1-64 found applications in gaming handhelds and early Android ecosystems, powering immersive experiences in devices like the series derivatives. In automotive , it enabled advanced visual computing modules for in-vehicle systems, supporting Android-based interfaces, navigation, and multimedia rendering. These deployments underscored the chip's versatility in delivering high-performance graphics and processing within power-sensitive, embedded scenarios.

Project Denver 2 and Later Iterations

Following the initial implementation in the Tegra K1, developed Project Denver 2 as an enhanced iteration of its custom ARMv8-compatible CPU core, aimed at delivering superior single-threaded performance through advanced dynamic code optimization techniques. This second-generation design incorporated improvements to the original Denver's in-order , enabling higher (IPC) rates—up to 7 micro-operations per cycle in optimized scenarios—while maintaining compatibility with ARMv8-A instruction sets. The core featured a wider execution and refined branch prediction mechanisms, including a global history buffer and return stack buffer, to reduce misprediction penalties and boost overall efficiency. Announced as part of 's 2015 roadmap during the X1 unveiling at CES, Denver 2 was initially planned for integration into the X1 SoC to provide out-of-order-like performance via and , targeting mobile and embedded applications with enhanced power efficiency on the 20 nm process. However, due to development timelines and a strategic "tick-tock" approach prioritizing rapid market entry with proven IP, opted to replace Denver 2 with off-the-shelf cores (four high-performance and four efficiency Cortex-A53 cores) in the final X1 design released later that year. This shift allowed the X1 to achieve broad adoption in devices like the and Google Pixel C, while deferring custom core deployment. Denver 2 ultimately debuted in 2016 within the Tegra X2 (codenamed Parker) SoC, fabricated on TSMC's 16 nm process, where it paired two Denver 2 cores with four Cortex-A57 cores in a heterogeneous big.LITTLE configuration alongside a 256-core Pascal GPU. This integration powered automotive and AI platforms such as the PX 2 and Jetson TX2, delivering up to 1.5 times the CPU performance of the Tegra X1 while emphasizing perf/watt gains for tasks. Beyond mobile SoCs, NVIDIA explored Project Denver variants for server and use cases around 2014–2015, envisioning high-performance -based processors to compete in and HPC environments with superior energy efficiency over x86 alternatives. These efforts, building on the original Denver's , were ultimately shelved amid shifting priorities toward GPU-accelerated computing and partnerships with licensees. The experiences from Project Denver iterations informed NVIDIA's later custom CPU developments, notably the Grace CPU Superchip announced in 2021, which employs proprietary cores optimized for workloads, achieving up to 10 times the performance of contemporary server CPUs in AI and HPC scenarios through high-bandwidth interconnects and scalable coherency. This marked a revival of NVIDIA's in-house CPU ambitions, leveraging lessons in dynamic optimization and ARM ecosystem integration from the Denver lineage.

Impact and Legacy

Performance Evaluations

The Tegra K1 implementation of Project Denver, featuring dual 64-bit cores clocked up to 2.5 GHz, delivered competitive CPU performance in synthetic benchmarks suitable for mobile devices. In 3 tests on devices like the 9, it recorded single-core scores of approximately 1,900 points, placing it on par with low-end i3 processors such as the 4th-generation mobile variants in single-threaded workloads. Multi-core scores reached around 3,000 points, benefiting from the cores' high clock speeds despite the dual-core configuration. These results highlighted Denver's focus on single-thread efficiency over multi-thread parallelism compared to quad-core contemporaries. Efficiency evaluations underscored Project Denver's advantages in power-constrained mobile scenarios, particularly when integrated with the K1's Kepler-based GPU. reported that the GPU provided 1.5 times the of competing mobile graphics solutions, enabling up to twice the efficiency in graphics-intensive tasks like rendering and relative to x86 equivalents in similar power envelopes. CPU power consumption under load typically ranged from 4-6 , supporting extended battery life in tablets while outperforming rivals like the Cortex-A15 in floating-point operations by up to 3x per core. In real-world applications on devices, the with Denver cores excelled in Android gaming and early 64-bit software. Titles such as and achieved frame rates exceeding 50 fps at high resolutions, while 64-bit apps like ran smoothly with reduced latency compared to 32-bit counterparts. This performance extended to multimedia tasks, including 4K video decoding at 30 fps, demonstrating practical viability for gaming handhelds and tablets. Limitations emerged in sustained workloads, where thermal throttling could occur to maintain temperatures below 90°C, potentially reducing clock speeds after prolonged use in compact form factors. Additionally, Denver's in-order execution pipeline resulted in lower (IPC) than the out-of-order Cortex-A15 in select integer-heavy tasks, such as certain database operations, despite overall higher clock-for-clock gains in other areas.

Discontinuation and Industry Influence

In the mid-2010s, discontinued further development of custom cores primarily due to the high complexity and extended timelines associated with in-house CPU design, opting instead for off-the-shelf Cortex cores to expedite product releases. This shift became evident with the X1 SoC in 2015, which employed and Cortex-A53 cores rather than Denver derivatives, allowing faster integration into mobile and embedded devices. Intense market competition exacerbated this decision, as Qualcomm's Snapdragon series dominated Android devices with optimized, volume-produced SoCs, while Apple's custom A-series chips set performance benchmarks in ecosystems, marginalizing 's lineup. The K1 remained the final major implementation featuring Denver cores. Despite its discontinuation, Project Denver exerted significant influence on the broader ecosystem by pioneering high-performance, custom CPU designs targeted at servers and supercomputers, which helped catalyze industry-wide interest in ARM-based solutions. This early demonstration of 's viability for demanding workloads contributed to the momentum behind server-grade adoption, exemplified by ' processors, which leverage custom cores for efficiency. Within NVIDIA, the project laid foundational expertise that paved the way for subsequent Arm-based innovations, including the Grace CPU superchip and its integration with Hopper GPUs for AI and . On the mobile front, Project Denver accelerated the transition to 64-bit architectures, with the K1's Denver CPU enabling the first 64-bit ARM processor in Android devices by late 2014, prompting to prioritize 64-bit support in Android 5.0 Lollipop and influencing ecosystem-wide upgrades. As of 2025, Project Denver's legacy endures in NVIDIA's AI server CPUs, such as the Vera CPU, which reintroduces custom ARM cores for enhanced performance in data centers, though without reviving the Denver architecture directly.

References

  1. https://en.wikichip.org/wiki/nvidia/microarchitectures/denver
Add your contribution
Related Hubs
User Avatar
No comments yet.