Hubbry Logo
ARM Cortex-A9ARM Cortex-A9Main
Open search
ARM Cortex-A9
Community hub
ARM Cortex-A9
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
ARM Cortex-A9
ARM Cortex-A9
from Wikipedia

ARM Cortex-A9
MediaTek M6575
General information
Launched2007
Designed byARM Holdings
Performance
Max. CPU clock rate0.8 GHz  to 2 GHz 
Physical specifications
Cores
  • 1–4
Cache
L1 cache32 KB I, 32 KB D
L2 cache128 KB–8 MB (configurable with L2sr1 cache controller)
Architecture and classification
Instruction setARMv7-A
History
PredecessorARM Cortex-A8
SuccessorARM Cortex-A12

The ARM Cortex-A9 MPCore is a 32-bit multi-core processor that provides up to 4 cache-coherent cores, each implementing the ARM v7 architecture instruction set.[1] It was introduced in 2007.[2]

Features

[edit]

Key features of the Cortex-A9 core are:[3]

  • Out-of-order speculative issue superscalar execution 8-stage[4] pipeline giving 8.50 DMIPS/MHz/core.
  • NEON SIMD instruction set extension performing up to 16 operations per instruction (optional).
  • High performance VFPv3 floating point unit doubling the performance of previous ARM FPUs (optional).
  • Thumb-2 instruction set encoding reduces the size of programs with little impact on performance.
  • TrustZone security extensions.
  • Jazelle DBX support for Java execution.
  • Jazelle RCT for JIT compilation.
  • Program Trace Macrocell and CoreSight Design Kit for non-intrusive tracing of instruction execution.
  • L2 cache controller (0–4 MB).
  • Multi-core processing.

ARM states that the TSMC 40G hard macro implementation typically operates at 2 GHz; a single core (excluding caches) occupies less than 1.5 mm2 when designed in a TSMC 65 nanometer (nm) generic process[5] and can be clocked at speeds over 1 GHz, consuming less than 250 mW per core.[2]

Chips

[edit]

Several system on a chip (SoC) devices implement the Cortex-A9 core, including:

Systems on a chip

[edit]
Developer Name Cores Process NEON SIMD Vector floating point unit GPU
Altera SoC FPGA 1–2 28 nm Yes VFPv3 optionally implemented in FPGA; TES Electronic Solutions D/AVE HD Archived 14 November 2017 at the Wayback Machine
Ambarella Inc. S3L 1 28 nm Yes VFPv3
AMLogic AML8726-M 1 65 nm Yes VFPv3 ARM Mali-400
AMLogic AML8726-MX 2 40 nm Yes VFPv3 ARM Mali-400 MP2
AMLogic AML8726-M8 4 28 nm Yes VFPv3 ARM Mali-450 MP6
Apple Inc. A5 2 32 nm
45 nm
Yes VFPv3 PowerVR SGX543MP2
Apple Inc. A5X 2 45 nm Yes VFPv3 PowerVR SGX543MP4
Broadcom BCM11311 (Persona ICE) 2 40 nm ? ? Broadcom Videocore IV
Broadcom BCM21654 1 40 nm Yes VFPv3 Broadcom Videocore IV
Broadcom BCM21664T 2 40 nm Yes VFPv3 Broadcom Videocore IV
Calxeda EnergyCore ECX-1000[9] 4 40 nm Yes VFPv3
ELVEES Multicore 1892VM14Ya 2 40 nm Yes VFPv3 ARM Mali-300
Freescale Semiconductor i.MX6[31] 1–4 40 nm Yes VFPv3-D32 Vivante Corporation GPU IP cores[32]
HiSilicon K3V2 (Hi3620) 4 40 nm Yes VFPv3 Vivante GC4000
Intel Cyclone V 1–2 28 nm Yes VFPv3
LG Corp LG L9 2 ? ? ? ARM Mali-400 MP4
Marvell ARMADA 38x 1–2 28 nm Yes VFPv3
Marvell PXA986 2 45 nm Yes VFPv3 PowerVR SGX540 / Vivante GC1000 (Galaxy Tab 3 7-inch)
Marvell PXA988 2 45 nm Yes VFPv3 Vivante GC1000
MediaTek MT6575 1 40 nm Yes VFPv3 PowerVR SGX531[14]
MediaTek MT6577 2 40 nm Yes VFPv3 PowerVR SGX531[15]
Mindspeed Technologies Comcerto 2000 2 ? Yes ?
Nufront NuSmartTM 2816(NS2816) 2 ? Yes VFPv3 ARM Mali-400[33]
Nufront NuSmartTM 2816M (NS2816M) 2 ? Yes VFPv3 ARM Mali-400
Nufront NuSmartTM 115 (NS115) 2 ? Yes VFPv3 ARM Mali-400
Nvidia Tegra 2 series 2 40 nm No VFPv3-D16 GeForce ULP
Nvidia Tegra 3 (Kal-El) series 4 40 nm Yes VFPv3 GeForce ULP
Renesas Electronics [1] ? ? ? ?
Renesas Electronics RZ/A1H[34] 1 various Yes VFPv3 WXGA 2D graphics 10MByte RAM SoC
Renesas Electronics RZ/A1M[34] 1 various Yes VFPv3 WXGA 2D graphics 5MByte RAM SoC
Renesas Electronics RZ/A1L[34] 1 various Yes VFPv3 WXGA 2D graphics 3MByte RAM SoC
Renesas Electronics RZ/A1LU[34] 1 various Yes VFPv3 RZ/A1L plus Ethernet AVB support and a JPEG codec unit, 3MByte RAM SoC
Rockchip RK2928 1 40 nm ? ? ARM Mali-400
Rockchip RK3066[22] 2 40 nm Yes VFPv3 ARM Mali-400 MP4
Rockchip RK3128 2 ? Yes VFPv3 ARM Mali-400 MP4
Rockchip RK3188[35] 4 28 nm Yes VFPv3 ARM Mali-400 MP4
Samsung Exynos 4 Dual (4210) 2 45 nm Yes VFPv3 ARM Mali-400 MP4
Samsung Exynos 4 Dual (4212) 2 32 nm Yes VFPv3 ARM Mali-400 MP4
Samsung Exynos 4 Quad (4412) 4 32 nm Yes VFPv3 ARM Mali-400 MP4
Samsung Exynos 4 Quad (4415) 4 28 nm Yes VFPv3 ARM Mali-400 MP4
STMicroelectronics SPEAr1310 ? ? No VFPv3
STMicroelectronics SPEAr1340 2 ? No VFPv3-D16 ARM Mali-200[36]
ST-Ericsson Nova A9500 2 45 nm Yes VFPv3 ARM Mali-400
ST-Ericsson NovaThor U8500 2 45 nm Yes VFPv3 ARM Mali-400
ST-Ericsson NovaThor U9500 2 45 nm Yes VFPv3 ARM Mali-400
Sony PlayStation Vita 4 40 nm Yes VFPv3 PowerVR SGX543MP4+
Texas Instruments Sitara AM437x 1 45 nm Yes VFPv3 SGX530 Graphics Engine
Texas Instruments OMAP4430
OMAP4460
2 45 nm Yes VFPv3 PowerVR SGX540
Texas Instruments OMAP4470 2 45 nm Yes VFPv3 PowerVR SGX544
Trident Microsystems PNX8473[37] 1 ? ? ? PowerVR SGX531
Trident Microsystems PNX8483[38] 1 ? ? ? PowerVR SGX531
Trident Microsystems PNX8491[39] 1 ? ? ? PowerVR SGX531
WonderMedia WM8850 1 40 nm Yes VFPv3 ARM Mali-400
WonderMedia WM8880 2 40 nm ? ? ARM Mali-400 MP2
WonderMedia WM8950 1 40 nm ? ? ARM Mali-400[28]
WonderMedia WM8980 2 40 nm ? ? ARM Mali-400 MP2
Xilinx Zynq-7000[40] 2 28 nm Yes VFPv3
ZiiLABS ZMS-20 ? ? Yes VFPv3 ZiiLABS flexible Stemcell media processing

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The ARM Cortex-A9 is a high-performance, power-efficient 32-bit processor core developed by , implementing the ARMv7-A architecture and designed for embedded applications in low-power, thermally constrained, and cost-sensitive devices. Introduced on March 31, 2008, with its initial revision (r0p0), it supports the ARM, Thumb, and Thumb-2 instruction sets, enabling versatile execution in single-core or multi-core configurations. Key features of the Cortex-A9 include a dual-issue, partially out-of-order 8-stage superscalar for enhanced instruction throughput, dynamic branch prediction, and configurable L1 caches of 16KB, 32KB, or 64KB per core, with support for an optional unified L2 cache up to 8MB. It incorporates the ARMv7 (MMU) for handling, TrustZone security extensions for protected execution environments, and optional NEON Advanced SIMD and Vector Floating-Point (VFPv3) units for and acceleration. The multiprocessor variant, known as Cortex-A9 MPCore, scales to up to four cores with cache coherency via the Accelerator Coherency Port (ACP) and a Snoop (SCU), facilitating (SMP) in systems requiring parallel performance. In terms of performance, the Cortex-A9 delivers over 50% improvement in single-core efficiency compared to its predecessor, the Cortex-A8, while maintaining low power consumption suitable for battery-operated devices; it also integrates CoreSight components for comprehensive debug and trace capabilities. Widely deployed since its launch, the core powers applications in smartphones, digital TVs, , and enterprise systems, with notable implementations in devices such as the 2, SPEAr1300, and OMAP4 SoCs. Its maturity and configurability as either speed-optimized or power-optimized IP have made it a foundational choice for -based system-on-chips (SoCs) in the late and early .

Introduction and History

Development Timeline

The was developed by as part of the ARMv7-A architecture family, succeeding the single-core Cortex-A8 and emphasizing multi-core scalability to address increasing performance needs in mobile devices. ARM officially announced the Cortex-A9 single-core and MPCore multi-core processors on October 8, 2007, at the ARM Developers' Conference in , highlighting their support for up to four cache-coherent cores based on the ARMv7 instruction set. The initial processor release occurred in 2008, with first silicon samples becoming available in late 2009; early demonstrations included 's multiprocessing implementation running OS at a private event in February 2009. Commercial availability began in 2010, as volume shipments of Cortex-A9-based silicon entered multiple market segments, including smartphones and embedded systems, with key partnerships such as enabling rapid adoption through early implementations like the U8500 platform.

Position in ARM Portfolio

The ARM Cortex-A9 serves as a high-performance, out-of-order processor core within the ARMv7-A architecture profile, designed specifically for applications processors in devices requiring robust computational capabilities while maintaining power efficiency. It introduced partial out-of-order execution to the ARM portfolio, marking a significant advancement over its predecessor, the Cortex-A8, which relied on an in-order pipeline and emphasized single-core implementations for simpler mobile applications. In contrast, the Cortex-A9 supported multi-core configurations, paving the way for its successor, the Cortex-A15, which further refined out-of-order processing with enhanced superscalar capabilities for even higher performance demands. Targeted at markets such as smartphones, tablets, and embedded systems, the Cortex-A9 balanced high performance with low power consumption, making it suitable for thermally constrained and cost-sensitive environments where multimedia and general-purpose computing were key. Within the broader ARMv7-A family, it positioned above lower-power options like the Cortex-A5, optimized for minimal area and energy use in basic embedded tasks, and the Cortex-A7, which focused on efficiency for entry-level devices with performance comparable to the A9 but in a smaller footprint. ARM offered the Cortex-A9 under a flexible licensing model, providing it as synthesizable (soft core) in RTL format for custom integration across various process nodes, or as pre-optimized hard macros tailored for specific processes to accelerate time-to-market and ensure guarantees. This approach enabled scalability, including dual-core configurations, to meet diverse system requirements without overhauling the core design.

Core Architecture

Processor Microarchitecture

The ARM Cortex-A9 processor employs an out-of-order superscalar microarchitecture to deliver high performance in embedded and mobile applications, implementing the ARMv7-A with support for the Thumb-2 instruction set for efficient code density. This design incorporates dynamic scheduling, allowing instructions to execute out of program order when dependencies permit, thereby maximizing resource utilization and reducing stalls in the execution . The integer pipeline consists of up to 8 stages, enabling efficient handling of while balancing power and area constraints typical of ARM's application processors. A key aspect of the is its support for dual-issue in operations, where up to two instructions can be dispatched per cycle from a variable-length decoder that processes the mixed 16- and 32-bit Thumb-2 encodings. This partially out-of-order model applies primarily to execution, with load/store operations also benefiting from dynamic reordering to overlap memory accesses effectively. Branch prediction is facilitated by a hybrid mechanism featuring a global history table, implemented as a 2-level dynamic predictor with a configurable Global History Buffer (GHB) with 1024, 2048, 4096, 8192, or 16384 entries, a Target Address Cache (BTAC), and a return stack to anticipate and minimize misprediction penalties. The core's scalability allows configuration as a single processor or in multi-core setups, such as the dual-core variant in the Cortex-A9 MPCore, where coherence between cores is maintained through the AMBA AXI interconnect protocol. This flexibility enables designers to tailor the processor for varying performance needs while integrating with AMBA-based system buses for instruction, data, and peripheral access.

Pipeline and Execution Units

The ARM Cortex-A9 features an 8-stage integer designed for , enabling superscalar processing with up to two instructions issued per cycle in optimal conditions. The stages consist of fetch, where instructions are retrieved from the instruction cache; decode, which can process up to two instructions simultaneously; rename, for to handle dependencies; dispatch, allocating instructions to appropriate queues; issue, scheduling ready instructions to execution units; execute, performing the computations; writeback, returning results to the register file; and retire, committing instructions in program order while handling exceptions. This structure supports to minimize stalls from branches and dependencies. The execution units include two integer arithmetic logic units (ALUs) for handling address calculations and general-purpose operations, a dedicated multiply-accumulate (MAC) unit for multiplication and accumulation tasks, and a load/store unit capable of one load and one store operation per cycle. These units allow for concurrent processing of up to four instructions in a cycle, including two ALU operations, one memory access, and one branch, enhancing throughput in integer workloads. Floating-point operations are supported through an integrated VFPv3 unit, which features a separate pipeline for scalar floating-point instructions compliant with IEEE 754. The VFPv3 unit achieves one double-precision fused multiply-accumulate (FMA) operation every two cycles, providing efficient support for single- and double-precision arithmetic. In multi-core configurations, the Snoop Control Unit (SCU) manages by implementing a snooping protocol that ensures data consistency across up to four cores through directed snoop requests and responses. Power efficiency is enhanced via , which disables clocks to inactive pipeline stages and units, and , allowing individual cores to enter low-power states while supporting dynamic voltage and .

Memory Hierarchy

The ARM Cortex-A9 processor features a multi-level optimized for high-performance embedded applications, comprising Level 1 (L1) caches tightly integrated with the core, an optional external Level 2 (L2) unified cache, a two-level (TLB) for address translation, and a (MMU) for support. This design balances low-latency access with scalability in single- and multi-core configurations, leveraging the ARMv7-A architecture. The L1 caches are Harvard-style, with separate instruction and caches that are configurable in size to 16 KB, 32 KB, or 64 KB per cache. Both are 4-way set-associative with 32-byte cache lines, enabling efficient prefetching and branch target buffering integration. The cache operates in write-back mode to minimize bus traffic, supporting write-allocate policies for cacheable regions. The L2 cache is a unified, external structure implemented via the ARM PrimeCell PL310 controller, configurable from 128 KB to 8 MB in 128 KB increments and typically organized as 16-way set-associative. It connects to the core through dedicated AXI master interfaces, providing shared access in multi-core setups and supporting exclusive caching modes to avoid duplication between L1 and L2 levels. The TLB architecture uses a two-level to reduce MMU lookup overhead. The first level includes separate micro-TLBs: a 32-entry fully associative micro-TLB and a configurable 32- or 64-entry instruction micro-TLB. The second-level main TLB is unified for instruction and , implemented as a configurable 2-way set-associative of 64 to 512 entries plus four fully associative lockable entries, allowing selective retention of critical translations. The MMU provides comprehensive virtual-to-physical address translation and protection, supporting 4 KB small pages as the base granule, along with larger section (1 MB) and supersection (16 MB) mappings in the standard ARMv7 configuration. In multi-core variants, the Cortex-A9 employs AMBA AXI interfaces—typically two 64-bit AXI masters per core—for all external memory accesses, with the (SCU) ensuring by snooping AXI transactions and broadcasting invalidations across cores. This AXI4-compatible setup supports system-level interconnects while maintaining low-latency coherence for up to four cores.

Key Features

SIMD and Vector Processing

The ARM Cortex-A9 incorporates the advanced SIMD extension as part of its ARMv7-A architecture, providing a dedicated media processing engine for vector operations. The unit is 128-bit wide, enabling parallel processing of multiple data elements within this vector length, and features a consisting of 32 64-bit registers (equivalent to 16 full 128-bit vectors) that support both and floating-point operations. These registers are shared with the VFPv3 unit, allowing seamless integration between scalar and vector floating-point computations. operations handle unsigned and signed data types from 8-bit to 64-bit, including polynomial arithmetic over , while floating-point support focuses on single-precision (32-bit) formats, with limited double-precision scalar capabilities. NEON instructions enable efficient vector arithmetic, such as VADD for element-wise addition and VMUL for multiplication, operating on vectors with up to 16 elements (e.g., sixteen 8-bit integers or four 32-bit floats per 128-bit vector). These instructions incorporate saturation modes to prevent overflow by clamping results to the representable range, and rounding modes for precise shifts and conversions, enhancing accuracy in tasks. Integration with VFPv3 extends this to vectorized floating-point operations, including fused multiply-add (VFMA) instructions that compute a*b + c in a single operation without intermediate rounding, reducing error accumulation in chained computations. This fusion applies to both scalar and vector forms, supporting up to four single-precision elements per instruction. In terms of performance, the unit can achieve up to 8 single-precision floating-point operations per cycle when leveraging the Cortex-A9's dual-issue capability, where two NEON instructions (e.g., a multiply followed by an add) are dispatched simultaneously to the execution pipelines. This throughput is realized in multimedia acceleration scenarios, such as H.264 video decoding, where NEON handles and inverse transforms on multiple pixel blocks in parallel, and 3D graphics , including vertex shading and . These capabilities make NEON particularly suited for embedded applications requiring efficient handling of audio, video, and image data streams.

Integer and Floating-Point Operations

The ARM Cortex-A9 processor implements scalar integer operations as part of the ARMv7-A architecture, supporting both the traditional 32-bit ARM instruction set and the Thumb-2 instruction set, which combines 16-bit and 32-bit instructions to achieve better code density while maintaining performance comparable to ARM instructions. All scalar integer operations feature conditional execution, enabling instructions to execute only if specified conditions (such as equality or greater-than) are met, which helps minimize branching and improve efficiency. Additionally, the architecture includes media-oriented instructions for (DSP) tasks, such as SMLAD, which performs two 16-bit signed multiplies followed by a 32-bit addition, useful for audio and image processing applications. Cycle timings for operations vary by instruction type but emphasize low latency for common arithmetic. Basic data-processing instructions like ADD and SUB complete in a single cycle, allowing high throughput in sequential computations. Multiply operations, such as MUL for 32-bit results, typically require 3-5 cycles depending on size and whether accumulation is involved, balancing precision with . Division instructions, including signed (SDIV) and unsigned (UDIV), take longer at 10-14 cycles to ensure accurate results, reflecting the complexity of the iterative algorithm used. These timings assume in-order execution without interlocks; in the Cortex-A9 can further optimize overall by scheduling dependent operations. For floating-point operations, the Cortex-A9 integrates an optional Vector Floating-Point (VFPv3) unit that handles single-precision (32-bit) and double-precision (64-bit) computations in compliance with the standard, providing robust support for scientific and graphics workloads. The VFPv3 unit includes fused multiply-accumulate (FMA) operations, which combine and into a single rounded result to reduce error accumulation in iterative calculations. Floating-point and subtraction require 3 cycles, enabling efficient scalar math in loops, while division operations range from 14 cycles for single-precision to 28 cycles for double-precision, due to the reciprocal approximation method employed. These timings position the VFPv3 as a high-performance when enabled, though it can be disabled for power savings in integer-only applications. The Cortex-A9 also supports the optional extension, which accelerates execution by allowing direct hardware interpretation of most bytecodes as a third execution state alongside and modes, though it is rarely utilized in modern implementations due to advancements in .

Security and Virtualization Support

The Cortex-A9 processor incorporates ARM TrustZone technology, which provides hardware-enforced isolation between a secure world for sensitive operations, such as cryptographic processing, and a normal world for general-purpose . This separation is achieved through a dedicated secure state in the processor, where the secure world maintains exclusive access to protected resources while the normal world operates under restricted privileges. All bus transactions originating from the processor include a Non-Secure (NS) bit, which tags accesses as secure or non-secure, enabling peripherals and memory systems to enforce isolation at the hardware level. Virtualization support in the Cortex-A9 is provided via optional extensions to the ARMv7-A , allowing for efficient operation through two-stage memory address . In this setup, stage-1 maps virtual addresses to intermediate physical addresses (IPAs) within a guest operating system, while stage-2 , managed by the , maps IPAs to physical addresses, supporting up to 40-bit IPAs when the extensions are enabled. These features enable secure partitioning of resources among multiple , with the running in a non-secure to oversee guest isolation without compromising performance. World switching between secure and normal states is facilitated by Secure Monitor Calls (SMC), which trigger an exception to enter the , a privileged state dedicated to handling transitions and maintaining isolation. The processor's interrupt controller integrates TrustZone by routing to either secure or non-secure handlers based on configuration bits, such as the FIQ enable bit, ensuring that secure interrupts remain protected from normal-world software. This dedicated handling prevents unauthorized access and supports real-time secure operations. The Cortex-A9 supports a (PAE) up to 40 bits when configured, expanding the addressable memory space beyond the standard 32 bits to accommodate large systems, such as those with up to 1 TB of RAM. This extension is optional and implementation-defined, allowing integrators to select it for applications requiring extensive physical memory mapping. Integration with the (MMU) extends these capabilities by supporting separate page tables for secure and non-secure worlds, where the NS bit determines which translation table is active during address resolution. In virtualization scenarios, the MMU applies both stages of , with secure page tables isolated to prevent tampering, thereby reinforcing TrustZone's model across virtualized environments.

Implementations

Single-Core Configurations

The ARM Cortex-A9 , also known as the uniprocessor variant, is implemented as a standalone high-performance core without multi-core clustering, targeting embedded and mobile applications requiring scalable performance. ARM offers this configuration in both synthesizable RTL and hard macro forms to facilitate integration into system-on-chips (SoCs) on advanced process nodes. Hard macros are available on 40 nm and 28 nm processes, enabling optimized area and power for production designs. In terms of operating frequencies, the single-core Cortex-A9 achieves up to 2.5 GHz in speed-optimized hard macro implementations on 28 nm, supporting demanding workloads while maintaining compatibility with ARMv7-A architecture. Typical clock speeds in mobile deployments range from 1 to 2 GHz, balancing and thermal constraints in battery-powered devices. Power consumption for a single core is approximately 500 mW at 1 GHz in power-optimized variants, contributing to energy-efficient operation. Configuration flexibility is a key aspect of single-core setups, allowing designers to tailor the processor to specific needs. L1 caches can be configured as 16 KB, 32 KB, or 64 KB for both instruction and data sides, with four-way set associativity. An optional unified L2 cache, managed via the L2C-310 controller, supports sizes up to 8 MB for improved . Additional options include Jazelle hardware acceleration for direct execution and ThumbEE extensions for in dynamic environments. ARM delivers the single-core Cortex-A9 as (IP) suitable for standalone use, often integrated via the uniprocessor package that excludes multi-core interconnects. This design enables efficient instruction throughput, supporting the high clock rates observed in these configurations.

Multi-Core Variants

The ARM Cortex-A9 MPCore implements multi-core configurations to enable (SMP), with support for up to four cores in a single cluster for enhanced parallelism while maintaining . The dual-core variant is the most prevalent implementation, favored in many designs for its balance of performance gains and power efficiency, as quad-core setups can increase thermal and energy demands without proportional benefits in typical embedded workloads. In dual-core MPCore setups, the two Cortex-A9 processors share a unified L2 cache configurable up to 8 MB via the PL310 controller, which provides low-latency access and supports speculative linefills to optimize bandwidth. The Snoop Control Unit (SCU) ensures coherency among the L1 caches of the cores using a snoop-based mechanism that broadcasts cache operations to maintain consistency across the cluster. This SCU also arbitrates L2 cache accesses and handles evictions, integrating with the cores' AXI interfaces for efficient memory transactions. Cache coherency in multi-core Cortex-A9 systems follows a MESI-like protocol for intra-cluster L1 interactions, extended by AMBA AXI Coherency Extensions () to support the AXI interconnect and enable coherent external accesses. The integrated Generic Interrupt Controller (GIC) version 1.0 distributes interrupts across cores, supporting up to 224 shared peripheral interrupts (SPIs) with per-core private interrupts for timers and watchdogs, facilitating efficient task scheduling in SMP environments. Performance scaling in dual-core configurations demonstrates near-linear gains in threaded applications, with representative implementations achieving almost 2x the single-core throughput while consuming only about 40% more power, highlighting the architecture's efficiency for parallel workloads.

Integration in SoCs

The ARM Cortex-A9 core was widely integrated into system-on-chips (SoCs) for mobile and embedded applications during the early , leveraging its ARMv7-A compatibility to enable efficient multi-core processing in power-constrained devices. NVIDIA's Tegra 2, released in 2010, featured a dual-core Cortex-A9 configuration clocked at 1 GHz, marking one of the first mobile SoCs with symmetric multi-processing support for enhanced performance in graphics-intensive tasks. This SoC powered early Android tablets such as the and , combining the CPU with an integrated GPU for multimedia applications. Samsung's Exynos 4210, introduced in 2011 and manufactured on a , incorporated a dual-core Cortex-A9 setup operating at 1.4 GHz, paired with a Mali-400 MP4 GPU to deliver improved graphics rendering for smartphones. It was prominently used in the , supporting playback and multitasking in mobile environments. Apple's A5 SoC, also launched in 2011 on a (later revised to 32 nm), utilized a dual-core Cortex-A9 design clocked at 800 MHz in its iPhone 4S variant, with a higher 1 GHz speed in the iPad 2 configuration; this implementation included custom optimizations for power efficiency alongside a PowerVR SGX543MP2 GPU. The A5 enabled seamless integration in devices, facilitating features like and improved graphics in games. Texas Instruments' OMAP 4 series, spanning models like the OMAP4430 and OMAP4460 from 2011 onward, employed dual-core Cortex-A9 processors scalable up to 1.5 GHz, targeted at both consumer mobile devices and industrial embedded systems. These SoCs included dedicated hardware accelerators for and video, making them suitable for applications in smartphones like the Motorola Droid RAZR and automotive . An example of a quad-core implementation is the NXP 6Quad, released in 2012 on a 40 nm process, featuring four Cortex-A9 cores at 1.0 GHz with integrated 2D/3D graphics acceleration. It has been widely adopted in industrial, automotive, and embedded systems for applications requiring higher parallelism. Other notable integrations included low-cost SoCs for budget tablets, such as Rockchip's RK3066 from 2012, which featured a dual-core Cortex-A9 at up to 1.6 GHz with a Mali-400 GPU to support affordable Android media consumption devices. While some early entrants like Allwinner's A10 targeted similar markets, it used a single Cortex-A8 core instead, highlighting the Cortex-A9's role in bridging performance and cost in emerging .

Applications and Performance

Device Adoption

The ARM Cortex-A9 processor powered several first-generation 4G smartphones, including the featuring 2. These devices marked early adoption in high-speed mobile connectivity, enabling advanced multimedia and multitasking capabilities in the Android ecosystem. In the tablet market, the Cortex-A9 saw significant uptake through the , which utilized the custom A5 SoC with a dual-core Cortex-A9 configuration, contributing to over 30 million units sold during its lifecycle and establishing tablets as mainstream consumer devices. Similarly, the employed the 2 SoC with dual-core Cortex-A9, enhancing portability and performance for media consumption in early Android tablets. The processor also appeared in set-top boxes and early smart televisions, notably powering Google TV platforms such as LG's L9 chipset-based models, which integrated a dual-core Cortex-A9 for seamless streaming and app integration. These implementations brought internet-connected features to home entertainment systems, with LG's early Google TV devices like the 47LM6700 series exemplifying the shift toward smart home interfaces. In automotive and embedded applications, the Freescale (now NXP) i.MX6 series, based on single- to quad-core Cortex-A9 configurations, was widely used in systems for features like , media playback, and connectivity. The i.MX6's supported rugged environments, powering dashboards in vehicles from manufacturers adopting OS precursors. The Cortex-A9 reached its market peak as the dominant processor in the 2011-2013 Android ecosystem, with widespread shipments across licensees enabling billions of devices in smartphones, tablets, and embedded systems. This era solidified its role in driving the explosion of .

Benchmark Comparisons

The Cortex-A9 processor exhibits substantial performance gains over the Cortex-A8, delivering more than 50% higher overall performance in single-core setups due to its and dual-issue pipeline. In workloads, it achieves roughly twice the performance of the Cortex-A8 at equivalent clock speeds, while tasks utilizing SIMD extensions show up to three times the throughput, benefiting from enhanced vector processing and reduced pipeline stalls. Benchmark results from 2 indicate dual-core Cortex-A9 configurations scoring approximately 800-1000 points, placing them on par with the N450 in contemporary applications. Compared to the later Cortex-A15, the A9 is 30-50% slower in CPU-intensive tasks per clock cycle but consumes less power, making it suitable for efficiency-focused designs. Power efficiency stands out at around 1000 DMIPS per watt in 28 nm processes, as evaluated via metrics, with the core rated at 2.5 DMIPS/MHz.
BenchmarkCortex-A9 (Single-Core, ~1 GHz)Comparison Context
Dhrystone2.5 DMIPS/MHzBaseline for power-normalized efficiency in 28 nm.
Geekbench 2 (Dual-Core)~800-1000Comparable to Intel Atom N450 multi-threaded loads.
NEON acceleration further boosts multimedia benchmarks, contributing to the A9's edge in vector-heavy workloads over in-order designs like the A8.

Legacy and Modern Relevance

The ARM Cortex-A9 processor significantly contributed to ARM's dominance in the market by introducing scalable multi-core configurations that balanced and power efficiency for battery-constrained devices. Its MPCore variant, supporting up to four cache-coherent cores, enabled high-performance applications in early smartphones and tablets, setting the stage for advanced heterogeneous architectures like big.LITTLE. This multi-core innovation allowed ARM to capture a substantial share of the growing mobile processor market, influencing the shift toward clustered in portable . As of 2025, the Cortex-A9 continues to find relevance in legacy embedded and industrial applications, particularly where cost and long-term stability outweigh the need for cutting-edge performance. For instance, NXP's 6DualPlus processor, featuring dual Cortex-A9 cores, remains actively available for multimedia-enabled , industrial IoT devices, and automotive systems like e-cockpits. Similarly, Artila's Matrix-770 serves as an Ubuntu Core-based IIoT gateway for industrial networking, leveraging the A9's reliability in low-to-mid-range connectivity solutions. These uses highlight its persistence in sectors such as IoT gateways and alternatives to higher-end single-board computers like , where mature ecosystems ensure ongoing viability. ARM has not declared the Cortex-A9 end-of-life, maintaining support through long-term maintenance agreements, with implementations like NXP's 6 series projected to receive updates until at least 2035. While new licensing for the A9 has diminished since the mid-2010s in favor of ARMv8-based designs, existing deployments benefit from sustained vendor support, ensuring compatibility and patches for embedded systems. The Cortex-A9 profoundly shaped subsequent multi-core ARM designs by pioneering cache-coherent multiprocessing in the high-performance segment, facilitating seamless scaling in symmetric multi-processing environments. Its adherence to the ARMv7-A architecture provides backward code compatibility with ARMv8 processors through the AArch32 execution state, allowing legacy A9 software to run on modern 64-bit ARM systems without major rewrites. However, it has been outpaced by ARMv8 cores in power efficiency; for example, the Cortex-A53 delivers comparable single-threaded performance to the A9 while consuming approximately 40% less area and energy, making newer cores preferable for demanding applications. Despite this, the A9 retains cost-effectiveness for low-end embedded tasks, where its proven integration and lower licensing overhead justify continued use over more advanced alternatives. Successors like the Cortex-A53 have built upon this foundation, emphasizing efficiency in entry-level multi-core scenarios.

References

  1. https://en.wikichip.org/wiki/samsung/exynos/4210
Add your contribution
Related Hubs
User Avatar
No comments yet.