ARM Cortex-A76

ARM Cortex-A76Main

Community hub

ARM Cortex-A76

7 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

ARM Cortex-A76

View on Wikipedia

from Wikipedia

Not found

Revisions and contributors Edit on Wikipedia Read on Wikipedia

View on Grokipedia

from Grokipedia

The ARM Cortex-A76 is a high-performance, 64-bit CPU core developed by Arm Holdings, implementing the Armv8.2-A architecture with support for extensions including Armv8.1-A, Armv8.3-A (LDAPR only), Armv8.4-A (SDOT/UDOT), Armv8.5-A (PSTATE SSBS), Cryptographic Extension, and RAS Extension.^[1] Announced on May 31, 2018, it features a superscalar, out-of-order microarchitecture based on DynamIQ technology, designed to deliver laptop-class single-threaded performance while maintaining smartphone-level power efficiency for demanding tasks in mobile and edge devices.^[2]^[3] The core supports AArch64 execution at all exception levels (EL0-EL3) and AArch32 at EL0 only, with ISA compatibility for A64, A32, and T32 instruction sets.^[1] Key architectural elements include a non-blocking, high-throughput L1 cache system with 64 KB instruction and 64 KB data caches, a private L2 cache configurable from 128 KB to 512 KB per core, and an optional shared L3 cache up to 4 MB.^[4] It incorporates advanced features such as decoupled branch prediction, a 4-wide decode unit, a fourth-generation prefetcher for instructions and data, and dual-issue 128-bit NEON and floating-point units that double the throughput of prior Arm CPUs.^[5] The core supports up to four CPUs per DynamIQ cluster, 40-bit physical addressing, LPAE for 40-bit virtual addressing, ECC for reliability, and interfaces like AMBA ACE or CHI for system integration, along with GICv4 interrupts, Armv8-A timers, CoreSight v3 debug, and ETMv4.2 trace capabilities.^[4]^[1] In terms of performance, the Cortex-A76 provides a 35% uplift in single-threaded performance over its predecessor, the Cortex-A75, and achieves 40% better performance within the same power envelope, enabling extended battery life for complex workloads like machine learning inference (with 4x improvement in low-precision ML tasks) and productivity applications.^[5] Optimized for 7 nm and advanced process nodes, it targets premium smartphones, laptops, automotive systems (including ASIL-D safety compliance via the Cortex-A76AE variant), and other edge-to-cloud devices requiring high efficiency and compute intensity.^[3]^[6]

Development

Announcement

The ARM Cortex-A76 CPU core was announced by Arm Holdings on May 31, 2018, marking the introduction of their latest premium processor design for high-performance mobile and embedded applications.^[5] The unveiling occurred alongside Computex 2018 in Taipei, where Arm emphasized the core's role in enabling "laptop-class performance with mobile efficiency" through advancements in the DynamIQ shared memory architecture.^[7] Internally codenamed "Enyo," the Cortex-A76 implements the ARMv8.2-A instruction set and is optimized for 7 nm manufacturing processes, supporting clock speeds up to 3.0 GHz.^[7]^[8] During the announcement, Arm positioned the Cortex-A76 as the successor to the Cortex-A75, highlighting its 64-bit-only kernel mode execution for enhanced security and efficiency in modern operating systems.^[8] Key performance claims included a 35% uplift in single-threaded performance over the A75 at the same power envelope, or up to 40% improved power efficiency at equivalent performance levels, based on internal evaluations using TSMC's 7 nm process.^[9] Additionally, Arm touted 4x faster machine learning inference and improvements in complex workloads like web browsing compared to the previous generation, underscoring its focus on AI and sustained performance for edge computing.^[5]^[10] Arm indicated that the Cortex-A76 would enter production availability in the second half of 2018, with commercial silicon integration expected in devices launching in the second half of 2019, enabling broader adoption in smartphones, tablets, and high-end routers.^[11] The announcement also coincided with reveals of complementary IP, including the Mali-G76 GPU and Mali-V76 video processor, to form a cohesive ecosystem for next-generation SoCs.^[9]

Design Objectives

Development of the Cortex-A76 began in 2013. The ARM Cortex-A76 was developed with the primary objective of bridging the performance gap between mobile and laptop computing, delivering high-end computational capabilities while maintaining the power efficiency essential for battery-constrained devices. Announced on May 31, 2018, as part of ARM's client CPU roadmap, the core aimed to support the transition to 7nm process nodes and enable always-connected experiences in the era of 5G connectivity. This design philosophy addressed the slowing pace of Moore's Law by focusing on architectural innovations that provide substantial single-threaded performance gains without proportionally increasing power consumption.^[12]^[3] Key targets included achieving a 35% uplift in instructions per clock (IPC) compared to the preceding Cortex-A75, emphasizing superscalar out-of-order execution and advanced branch prediction to handle complex workloads more effectively. The microarchitecture was re-engineered to prioritize energy efficiency for sustained tasks, such as productivity applications and emerging machine learning at the edge, while extending battery life in mobile scenarios. ARM emphasized that these improvements would allow the Cortex-A76 to run desktop-like applications seamlessly on smartphones and laptops, fostering a unified computing experience across devices.^[12] In terms of applications, the Cortex-A76 was optimized for premium mobile SoCs targeting smartphones, Windows on ARM laptops, and automotive systems, with variants like the Cortex-A76AE incorporating safety features for autonomous vehicles. The design sought to balance raw performance with thermal and power envelopes typical of mobile platforms, enabling features like always-on AI processing and high-fidelity graphics without compromising responsiveness. Overall, these objectives positioned the core as a foundational element for next-generation edge computing, where efficiency and performance scalability are paramount.^[3]^[13]

Architecture

Microarchitecture Overview

The ARM Cortex-A76 is a high-performance, 64-bit CPU core implementing the Armv8.2-A instruction set architecture, featuring a ground-up redesigned out-of-order superscalar microarchitecture optimized for sustained high performance in mobile and edge computing applications. It is designed for integration within Arm's DynamIQ technology, allowing flexible multi-core configurations in shared units (DSUs) with up to four Cortex-A76 cores per cluster. The core supports 40-bit physical addressing for up to 1 TB of memory and includes Harvard architecture with separate 64 KB instruction and 64 KB data L1 caches, each 4-way set-associative and virtually indexed, physically tagged. A private L2 cache per core, configurable from 128 KB to 512 KB, provides low-latency access with 9-cycle load-to-use latency, while an optional shared L3 cache in the DSU ranges from 512 KB to 4 MB.^[4]^[5] The front-end of the pipeline emphasizes high instruction throughput and efficient control flow handling through a decoupled branch prediction unit operating independently of the instruction fetch stage, enabling the predictor to run at double the fetch bandwidth to mask misprediction penalties. The fetch unit delivers 4 to 8 instructions per cycle, supported by multi-level branch target caches and a hybrid indirect branch predictor to maximize accuracy and throughput in complex code paths. Following fetch, the pipeline includes Arm's first 4-wide decode stage, capable of renaming and dispatching up to 8 micro-operations per cycle to the out-of-order execution engine, which features a deep reorder buffer for handling dependencies and speculation. This design contributes to a 35% increase in single-threaded performance compared to the predecessor Cortex-A75.^[5] In the execution backend, the Cortex-A76 employs quad-issue integer execution with three simple arithmetic logic units (ALUs) for basic operations and one complex ALU handling multi-cycle instructions like division and multiplication, enabling high throughput for scalar workloads. Floating-point and advanced SIMD (Neon) processing is powered by dual 128-bit pipelines, doubling the vector/FP bandwidth over prior Arm designs and delivering up to 4x performance for low-precision machine learning inference tasks. The load/store unit supports deep memory-level parallelism with a sophisticated fourth-generation hardware prefetcher, optimizing for bandwidth-intensive applications while maintaining 4-cycle L1 load-to-use latency; it interfaces via AMBA ACE or CHI protocols for system-level coherence. Security features include Arm TrustZone, optional cryptography extensions (AES, SHA, PMULL), and RAS (Reliability, Availability, Serviceability) support, with ECC protection available for caches and interconnects. Overall, these elements enable a 35% performance uplift and 40% power efficiency improvement over the Cortex-A75 at iso-process and frequency.^[4]^[5]

Pipeline Design

The ARM Cortex-A76 implements a high-performance, superscalar, out-of-order execution pipeline optimized for both integer and floating-point workloads in power-constrained environments. This design supports the ARMv8.2-A architecture and integrates with DynamIQ technology for flexible multi-core configurations. The pipeline emphasizes sustained performance through advanced speculation and parallelism, targeting applications from mobile computing to edge servers.^[4]^[5] At its core, the pipeline spans 13 stages, balancing depth for high clock frequencies—up to 3.0 GHz on 7 nm processes—with latency management for efficient instruction throughput. The front-end operates as a 4-wide superscalar unit, with the fetch delivering up to 8 instructions per cycle from a 64 KiB L1 instruction cache and decoding variable-length ARM instructions into micro-operations (uops). This includes macro-op fusion to reduce decode pressure and improve density for common instruction sequences. Once decoded, uops enter a register rename stage before dispatch to a reorder buffer supporting a 128-entry instruction window, enabling dynamic reordering to tolerate dependencies and hide latencies.^[14]^[15] The back-end features an 8-wide dispatch to specialized execution pipelines, including three simple integer ALUs, one complex integer ALU, two load/store units, a branch execution unit, and two vector/floating-point units for NEON and Advanced SIMD operations. Load/store units connect to a 64 KiB L1 data cache, exploiting memory-level parallelism with dual ports for concurrent accesses. Retirement occurs in-order at up to 4 instructions per cycle, ensuring architectural state consistency while the out-of-order engine maximizes utilization. This structure allows the core to sustain high IPC, with reported uplifts of 35% in single-threaded performance over the Cortex-A75.^[14]^[16] Branch prediction plays a critical role in maintaining pipeline momentum, employing a decoupled predictor separate from the fetch unit to precompute targets and directions ahead of time. It incorporates a multilevel branch target buffer (BTB) with 2x the bandwidth of the fetch unit, supporting indirect branches and improving accuracy on complex control flow—reducing misprediction rates compared to prior generations. Mispredictions incur an 11-cycle penalty, mitigated by the deep pipeline's speculation window. Overall, these elements enable the Cortex-A76 to deliver desktop-like performance with mobile efficiency, as evidenced in implementations like Qualcomm's Snapdragon 855.^[15]^[14]

Memory Hierarchy

The ARM Cortex-A76 employs a multi-level cache hierarchy optimized for low latency and high bandwidth in high-performance mobile and embedded systems, featuring private per-core L1 and L2 caches alongside an optional shared L3 cache to support efficient memory access patterns in multi-core configurations. This design balances the demands of sustained performance with power efficiency, enabling the core to handle complex workloads while minimizing stalls from memory dependencies.^[4]^[3] At the first level, each Cortex-A76 core includes a private 64 KB instruction cache (L1I) and a 64 KB data cache (L1D), both implemented as 4-way set associative structures with 64-byte cache lines to facilitate rapid access and prefetching of instructions and data. The L1 caches support write-back and write-allocate policies, with a load-to-use latency of 4 cycles, allowing the out-of-order execution engine to overlap memory operations effectively and reduce pipeline bubbles. Additionally, the L1 data cache incorporates hardware data prefetching mechanisms that detect common access patterns, such as sequential or stride-based loads, to proactively fetch data into the cache and further mitigate latency impacts on performance-critical applications.^[4]^[17]^[18] The second-level cache (L2) is private to each core and configurable in size from 128 KB to 512 KB, operating as a 16-way set associative, inclusive unified cache that backs the L1 caches with a latency of approximately 9 cycles for load-to-use operations. This L2 structure provides a 256-bit read interface from the cache and a matching write interface, supporting up to two 128-bit loads or stores per cycle to sustain the core's dual-issue execution capabilities while ensuring coherence through AMBA CHI or ACE protocols in multi-core systems. The inclusive design simplifies management by automatically invalidating L1 lines upon L2 eviction, contributing to predictable behavior in cache-coherent environments.^[4]^[17]^[19] An optional shared L3 cache, ranging from 512 KB to 4 MB, can be implemented at the cluster level to serve multiple Cortex-A76 cores, offering a latency of 26 to 31 cycles and enhancing bandwidth for shared data access in scenarios like multi-threaded applications. This level integrates with the system's interconnect fabric to maintain coherence and supports ECC for reliability in enterprise-grade deployments.^[4]^[17] The memory management unit (MMU) complements the cache hierarchy with dedicated translation lookaside buffers (TLBs) to accelerate virtual-to-physical address translations. The L1 instruction TLB (ITLB) and data TLB (DTLB) are each 48-entry fully associative arrays, natively supporting page sizes of 4 KB, 16 KB, 64 KB, 2 MB, 32 MB, and 512 MB for efficient handling of large memory mappings common in 64-bit ARMv8.2-A environments. These are backed by a unified L2 TLB with 1280 entries organized as 5-way set associative, which aggregates misses from the L1 TLBs and interfaces with the page table walker to minimize translation overhead during cache fills or direct memory accesses. The TLB design incorporates support for large physical address extensions (LPAE) up to 40 bits, ensuring scalability for systems with expansive memory footprints.^[20]^[21]

Key Features

Instruction Set Extensions

The ARM Cortex-A76 core implements the ARMv8-A instruction set architecture, supporting the 64-bit AArch64 execution state with the fixed-length 32-bit A64 instruction set, as well as the 32-bit AArch32 execution state using the A32 (ARM) and T32 (Thumb) instruction sets. The AArch32 support is limited to EL0 (user mode) execution level. These base instruction sets provide the foundation for general-purpose computing, including scalar integer operations, advanced SIMD (NEON) for vector processing, and floating-point arithmetic via the VFPv4 architecture.^[4] The core incorporates several extensions to the ARMv8-A base, enhancing performance in areas such as atomic operations, cryptography, reliability, and memory consistency. The ARMv8.1-A extension adds atomic memory access instructions under the Large System Extensions (LSE) feature, including load-add (LDADD), load-clear (LDCLR), load-set (LDSET), and swap (SWP) variants for byte, halfword, word, and doubleword sizes in AArch64. These instructions enable lock-free programming and improve scalability in multi-core environments by providing single-copy atomicity without requiring exclusive monitors. Additionally, ARMv8.1-A introduces advanced SIMD instructions for half-precision (FP16) floating-point operations and support for 4KB page table descriptors in AArch32.^[22]^[23] Building on ARMv8.1-A, the ARMv8.2-A extension includes mandatory support for half-precision floating-point in the scalar and Advanced SIMD units, with instructions like FCVT (convert between FP16 and other formats) and FMUL (multiply FP16). It also adds enhancements for large systems, such as improved virtualization and memory management, though the Cortex-A76 does not implement optional components like Scalable Vector Extension (SVE).^[24]^[4] The ARMv8.4-A extension adds Dot Product instructions to Advanced SIMD (e.g., UDOT and SDOT for unsigned and signed 8-bit integer dot products), which accelerate matrix multiplications and are particularly beneficial for machine learning workloads.^[1]^[4] The ARMv8.5-A extension provides support for the PSTATE Speculative Store Bypass Safe (SSBS) bit, which helps mitigate speculative store bypass vulnerabilities.^[1] An optional Cryptographic Extension, based on the ARMv8-A Cryptography feature, integrates hardware acceleration directly into the Advanced SIMD unit with new A64, A32, and T32 instructions. These include AES instructions (AESE for encrypt, AESD for decrypt, AESMC for mix columns), SHA-1 instructions (SHA1C, SHA1M, SHA1H, SHA1SU0, SHA1SU1), SHA-256 instructions (SHA256H, SHA256H2, SHA256SU0, SHA256SU1), polynomial multiplication (PMULL and PMULL2 for carryless multiply used in GCM mode), and CRC-32 computation (CRC32B, CRC32H, CRC32W, CRC32X, CRC32CB, CRC32CH, CRC32CW, CRC32CX). Optional sub-features add SHA-3 (EOR3, RORV, XAR, BCAX, BDEP, BEXT, BGRP, BSL, BIF) and Chinese SM3/SM4 algorithms. This extension significantly boosts throughput for encryption and hashing in security-critical applications.^[25]^[4] The Reliability, Availability, and Serviceability (RAS) extension, introduced in ARMv8.2-A, adds the Error Synchronization Barrier (ESB) instruction across A32, T32, and A64 to ensure error records are visible before proceeding, along with new system registers (e.g., ERRIDR_EL1 for error identification, ERXFR_EL1 for external error forwarding). These facilitate hardware error detection, reporting, and recovery, enhancing system robustness in server and high-reliability environments.^[24]^[4] Finally, the core provides partial support for ARMv8.3-A through the Load-Acquire RCpc (Release Consistent processor consistent) instructions, specifically LDAPR, LDAPRB, LDAPRH, and LDAPRX. These load-acquire operations offer weaker ordering guarantees than full acquire semantics, allowing reordering with subsequent stores to different addresses for improved performance in concurrent programming while maintaining compatibility with C++ memory models. Full ARMv8.3-A features like pointer authentication are not supported.^[26]^[4]

Security and Virtualization

The ARM Cortex-A76 core, based on the ARMv8-A architecture, provides robust hardware support for security through TrustZone technology, which enforces isolation between secure and non-secure execution environments at the exception level EL3 (Secure Monitor). This enables the implementation of a Trusted Execution Environment (TEE) for protecting sensitive data and operations, such as cryptographic keys and secure boot processes, from untrusted software in the normal world. TrustZone extends to peripherals, interrupts, and memory, allowing system-wide partitioning configurable by the secure monitor.^[27]^[28] Additionally, the optional Cryptographic Extension accelerates common security algorithms, including AES encryption/decryption in modes like ECB, CBC, and GCM, as well as SHA-1, SHA-256, and SHA-512 hashing, enabling efficient handling of secure communications and data integrity checks.^[28]^[29] For virtualization, the Cortex-A76 implements the full ARMv8-A virtualization extensions, supporting EL2 (Hypervisor) mode to manage multiple guest operating systems with isolated virtual address spaces and resources. The memory management unit (MMU) facilitates this through stage-2 address translations, enabling efficient memory virtualization while maintaining protection against guest-to-guest interference. The core also includes the Virtualization Host Extensions (VHE) from ARMv8.1-A, which allow the host OS to execute at EL2 with near-native performance by reducing unnecessary traps and context switches for host instructions, such as system calls. This VHE support, combined with Address Space ID (ASID) management at EL2, optimizes overhead in multi-tenant environments like cloud or server applications.^[30]^[31]^[28] These features integrate seamlessly in DynamIQ Shared Unit (DSU) configurations, where multiple Cortex-A76 cores can share virtualization and security contexts, supporting scalable deployments in devices requiring both isolation and efficiency, such as smartphones and edge servers.^[31]

Performance and Efficiency

Benchmark Results

The ARM Cortex-A76 demonstrated significant performance advancements over its predecessor, the Cortex-A75, particularly in integer and floating-point workloads. In SPECint2006 benchmarks, the A76 achieved a 25% improvement in integer performance compared to the A75 when evaluated at the same process node and frequency. Similarly, SPECfp2006 results showed a 35% uplift in floating-point operations under identical conditions. These gains were validated through early implementations, such as Huawei's Kirin 980 SoC, where the A76-based cores delivered 1.89 times the integer performance and 2.04 times the floating-point performance of the Cortex-A73 in the Snapdragon 835 at 2.6 GHz versus 2.45 GHz. Efficiency metrics further highlighted the A76's design strengths, with ARM reporting up to 40% better power efficiency at equivalent performance levels to the A75, enabling sustained operation in mobile and laptop scenarios without excessive thermal constraints. Memory subsystem enhancements contributed substantially, as LMBench tests indicated a 90% increase in bandwidth over the A75, reducing bottlenecks in data-intensive tasks. In real-world SoC integrations like Qualcomm's Snapdragon 855, which clocked A76 cores up to 2.84 GHz, single-threaded Geekbench 4 scores reached approximately 3,500, representing a 45% leap over the Snapdragon 845's A75 configuration, while multi-threaded scores approached 11,000.^[32]

Benchmark	Cortex-A76 (vs. A75)	Implementation Example	Source
SPECint2006 (Integer)	+25%	Iso-process/frequency	^[17]
SPECfp2006 (Floating-Point)	+35%	Iso-process/frequency	^[17]
Memory Bandwidth (LMBench)	+90%	N/A	^[17]
SPECint2006 (vs. A73)	1.89x	Kirin 980 @ 2.6 GHz	^[33]
SPECfp2006 (vs. A73)	2.04x	Kirin 980 @ 2.6 GHz	^[33]
Geekbench 4 Single-Core (vs. A75)	+45%	Snapdragon 855 @ 2.84 GHz	^[32]

Overall, these results positioned the A76 as a foundational core for flagship mobile devices in 2019, balancing high throughput with energy constraints typical of battery-powered systems. ARM's internal modeling projected significant uplifts in SPEC suites across early adopters.

Power Consumption

The ARM Cortex-A76 core is engineered for high performance within constrained power envelopes typical of mobile and embedded systems, achieving significant efficiency gains through microarchitectural optimizations such as improved branch prediction, wider execution pipelines, and enhanced prefetching mechanisms that reduce energy waste from stalls and misses.^[5] These design choices enable the core to deliver laptop-class computational throughput while adhering to smartphone-level power budgets, supporting extended battery life in devices like premium mobiles and always-connected PCs.^[2] Compared to its predecessor, the Cortex-A75, the A76 provides a 40% improvement in power efficiency at equivalent performance levels, allowing for 40% higher performance within the same power allocation.^[5] This uplift stems from targeted reductions in area and power overheads in the out-of-order execution engine and cache hierarchy, alongside integration with ARM's DynamIQ technology, which facilitates heterogeneous clustering with low-power cores like the Cortex-A55 for workload-specific power gating and voltage scaling.^[2] In practice, such efficiencies contribute to over 20 hours of battery life in ARM-based Windows 10 devices running productivity applications.^[2] The core's power profile benefits from advanced features including fine-grained power domains for the integer and floating-point units, as well as support for ARM's Maximum Power Mitigation Mechanism (MPMM), which uses activity monitors to dynamically cap power draw during thermal events without full throttling.^[5] When implemented on 7nm process nodes at frequencies up to 3 GHz, these elements ensure the A76 maintains competitive energy-per-instruction metrics, particularly for machine learning inference tasks, where it achieves 4x the performance of prior generations at iso-power.^[2] Overall, the design prioritizes sustainable efficiency for sustained workloads, balancing peak performance with low leakage and active power dissipation.

Implementations and Usage

Licensing Model

The ARM Cortex-A76 core is licensed by Arm Holdings as semiconductor intellectual property (IP) to semiconductor manufacturers, fabless design companies, and system integrators for incorporation into custom system-on-chip (SoC) designs. This licensing enables licensees to configure the core within Arm's DynamIQ Shared Unit (DSU) for scalable, heterogeneous computing clusters, supporting integration with other Arm IP such as GPUs, interconnects, and memory controllers via standard AMBA interfaces.^[3] The primary licensing pathway for the Cortex-A76 is Arm Flexible Access, a subscription-based model that provides broad, low-barrier entry to Arm's IP portfolio, including the Cortex-A series. Under this program, eligible parties—ranging from startups and research institutions to established enterprises—gain unlimited design rights and access to documentation, models, and tools without upfront fees, with costs deferred until tape-out or production. Qualifying startups and academic users receive zero-cost access for prototyping and evaluation, while manufacturing incurs per-project fees or royalties scaled to volume, promoting innovation in mobile, automotive, and high-performance computing applications.^[34]^[3] Arm also supports traditional licensing options, such as perpetual or time-bound subscriptions, which involve negotiated upfront payments for IP rights followed by per-unit royalties upon commercialization. These models allow for customized configurations and are tailored to high-volume producers, ensuring compliance with Arm's architecture specifications while permitting limited modifications under separate agreements. All licenses emphasize royalty-based revenue sharing to align with Arm's ecosystem-driven business strategy.^[35]

Adopted SoCs and Devices

The ARM Cortex-A76 core saw widespread adoption in high-end mobile system-on-chips (SoCs) starting in late 2018, primarily for premium smartphones seeking improved performance and efficiency over previous generations. Early implementations focused on DynamIQ-compatible configurations combining A76 performance cores with Cortex-A55 efficiency cores, enabling balanced big.LITTLE architectures for demanding tasks like gaming and AI processing. These SoCs marked a shift toward laptop-class CPU capabilities in mobile devices while maintaining power constraints suitable for battery-powered platforms.^[4] HiSilicon's Kirin 980 was the first commercial SoC to integrate the Cortex-A76, announced in September 2018 and fabricated on a 7 nm process. It features a quad-cluster setup with two high-performance A76 cores at 2.6 GHz, two mid-performance A76 cores at 1.92 GHz, and four A55 cores at 1.8 GHz, delivering up to 75% better single-threaded performance compared to the prior Kirin 970. This SoC powered flagship Huawei devices, including the Mate 20, Mate 20 Pro, and Honor View 20, emphasizing advancements in AI via its dual NPU design.^[36]^[37] Qualcomm's Snapdragon 855, also on 7 nm and launched in December 2018, adopted a similar tri-cluster approach with one prime A76 core at 2.84 GHz, three performance A76 cores at 2.42 GHz, and four A55 cores at 1.8 GHz under the Kryo 485 branding. This configuration provided a 45% CPU uplift over the Snapdragon 845, supporting 4K video and enhanced machine learning. It was integrated into numerous Android flagships, such as the Samsung Galaxy S10 series, OnePlus 7, and Sony Xperia 1, driving widespread availability in global markets.^[38]^[39] Samsung's Exynos 9820, introduced in February 2019 on an 8 nm process, blended custom Mongoose M4 cores with A76 for its premium lineup, using two M4 cores at 2.73 GHz, two A76 cores at 2.2 GHz, and four A55 cores at 1.95 GHz. This hybrid design aimed for optimized multimedia and gaming performance, appearing in regional variants of the Galaxy S10 and Note 10 series, particularly in Europe and Asia.^[40] Subsequent iterations extended A76 usage to mid-range and 5G SoCs. For instance, the HiSilicon Kirin 990 (2019, 7 nm+ EUV) upgraded to two A76 cores at 2.86 GHz and two at 2.09 GHz alongside four A55 cores, incorporating an integrated 5G modem; it drove Huawei's Mate 30 Pro and P40 series with superior ISP capabilities for photography. Qualcomm's Snapdragon 720G (2020, 8 nm) targeted affordable 5G devices with two A76 cores at 2.3 GHz and six A55 cores at 1.8 GHz, featured in phones like the Realme 6 Pro and Xiaomi Redmi Note 9S. MediaTek's Helio G99, announced in May 2022 on a 6 nm process, features two A76 cores at 2.2 GHz and six A55 cores at 2.0 GHz with a Mali-G57 MC2 GPU, aimed at budget gaming smartphones; it powers devices such as the Xiaomi Poco M5 and Realme Narzo 50 series.^[41]^[42]^[43] Beyond smartphones, the A76 found applications in embedded and development platforms. Rockchip's RK3588 (2022, 8 nm) includes four A76 cores at up to 2.4 GHz and four A55 cores, optimized for AI and multimedia with a 6 TOPS NPU and 8K video support; it powers single-board computers (SBCs) such as the Radxa Rock 5B, Orange Pi 5, and Banana Pi BPI-M7, used in edge computing, media players, and prototyping. The Broadcom BCM2712 SoC, used in the Raspberry Pi 5 released in October 2023 on a 16 nm process, integrates four A76 cores at 2.4 GHz with a VideoCore VII GPU, targeted at hobbyist, educational, and general-purpose computing applications. Allwinner's A733, launched in late 2024 on a 12 nm process, combines two A76 cores at 2.0 GHz and six A55 cores at 1.8 GHz with an optional 3 TOPS NPU and a RISC-V E902 core, supporting up to 16 GB RAM for AI tasks in Android tablets and laptops, such as the Teclast P50Ai. In programmable hardware, Intel's Agilex 5 D-Series FPGAs (2023) incorporate two A76 cores in their hard processor system (HPS) alongside two A55 cores, enabling customizable SoC designs for industrial and data center applications.^[44]

SoC	Manufacturer	Core Configuration	Process Node	Launch Year	Example Devices/Platforms
Kirin 980	HiSilicon	2×[email protected] GHz + 2×[email protected] GHz + 4×A55	7 nm	2018	Huawei Mate 20 Pro, Honor View 20
Snapdragon 855	Qualcomm	1×[email protected] GHz + 3×[email protected] GHz + 4×A55	7 nm	2018	Samsung Galaxy S10, OnePlus 7
Exynos 9820	Samsung	2×[email protected] GHz + 2×[email protected] GHz + 4×A55	8 nm	2019	Samsung Galaxy S10 (Exynos variant)
Kirin 990	HiSilicon	2×[email protected] GHz + 2×[email protected] GHz + 4×A55	7 nm+	2019	Huawei Mate 30 Pro, P40 Pro
Snapdragon 720G	Qualcomm	2×[email protected] GHz + 6×[email protected] GHz	8 nm	2020	Realme 6 Pro, Xiaomi Redmi Note 9S
Helio G99	MediaTek	2×[email protected] GHz + 6×[email protected] GHz	6 nm	2022	Xiaomi Poco M5, Realme Narzo 50
RK3588	Rockchip	4×[email protected] GHz + 4×[email protected] GHz	8 nm	2022	Radxa Rock 5B, Orange Pi 5
BCM2712	Broadcom	4×[email protected] GHz	16 nm	2023	Raspberry Pi 5
A733	Allwinner	2×[email protected] GHz + 6×[email protected] GHz	12 nm	2024	Teclast P50Ai
Agilex 5 HPS	Intel	2×A76 + 2×A55	N/A (FPGA)	2023	Agilex 5 D-Series FPGA development kits

Info Pages

Talk Pages

Special Pages

ARM Cortex-A76

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

ARM Cortex-A76

ARM Cortex-A76

Development

Announcement

Design Objectives

Architecture

Microarchitecture Overview

Pipeline Design

Memory Hierarchy

Key Features

Instruction Set Extensions

Security and Virtualization

Performance and Efficiency

Benchmark Results

Power Consumption

Implementations and Usage

Licensing Model

Adopted SoCs and Devices

References

Add your contribution

Related Hubs

Contribute something

History

ARM Cortex-A76

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

ARM Cortex-A76

ARM Cortex-A76

Development

Announcement

Design Objectives

Architecture

Microarchitecture Overview

Pipeline Design

Memory Hierarchy

Key Features

Instruction Set Extensions

Security and Virtualization

Performance and Efficiency

Benchmark Results

Power Consumption

Implementations and Usage

Licensing Model

Adopted SoCs and Devices

References

Add your contribution

Related Hubs

Contribute something