Hubbry Logo
ARM Cortex-X1ARM Cortex-X1Main
Open search
ARM Cortex-X1
Community hub
ARM Cortex-X1
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
ARM Cortex-X1
ARM Cortex-X1
from Wikipedia
ARM Cortex-X1
General information
Launched2020
Designed byARM Ltd.
Performance
Max. CPU clock rateto 3.0 GHz in phones and 3.3 GHz in tablets/laptops 
Address width40-bit
Cache
L1 cache128 KiB (64 KiB I-cache with parity, 64 KiB D-cache) per core
L2 cache512–1024 KiB per core
L3 cache512 KiB – 8 MiB (optional)
Architecture and classification
MicroarchitectureARM Cortex-X1
Instruction setARMv8-A: A64, A32, and T32 (at the EL0 only)
Extensions
Physical specifications
Cores
  • 1–4 per cluster
Products, models, variants
Product code name
  • Hera
Variant
History
SuccessorARM Cortex-X2

The ARM Cortex-X1 is a central processing unit implementing the ARMv8.2-A 64-bit instruction set designed by ARM Holdings' Austin design centre as part of ARM's Cortex-X Custom (CXC) program.[1][2]

Design

[edit]

The Cortex-X1 design is based on the ARM Cortex-A78, but redesigned for purely performance instead of a balance of performance, power, and area (PPA).[1]

The Cortex-X1 is a 5-wide decode out-of-order superscalar design with a 3K macro-OP (MOPs) cache. It can fetch 5 instructions and 8 MOPs per cycle, and rename and dispatch 8 MOPs, and 16 μOPs per cycle. The out-of-order window size has been increased to 224 entries. The backend has 15 execution ports with a pipeline depth of 13 stages and the execution latencies consists of 10 stages. It also features 4x128b SIMD units.[3][4][5][6]

ARM claims the Cortex-X1 offers 30% faster integer and 100% faster machine learning performance than the ARM Cortex-A77.[3][4][5][6]

The Cortex-X1 supports ARM's DynamIQ technology, expected to be used as high-performance cores when used in combination with the ARM Cortex-A78 mid and ARM Cortex-A55 little cores.[1][2]

Architecture changes in comparison with ARM Cortex-A78

[edit]
  • Around 20% performance improvement (+30% from A77)[7]
    • 30% faster integer
    • 100% faster machine learning performance
  • Out-of-order window size has been increased to 224 entries (from 160 entries)
  • Up to 4x128b SIMD units (from 2x128b)
  • 15% more silicon area
  • 5-way decode (from 4-way)
  • 8 MOPs/cycle decoded cache bandwidth (from 6 MOPs/cycle)
  • 64 KB L1D + 64 KB L1I (from 32/64 KB L1)
  • Up to 1 MB/core L2 cache (from 512 KB/core max)
  • Up to 8 MB L3 cache (from 4 MB max)

Licensing

[edit]

The Cortex-X1 is available as SIP core to partners of their Cortex-X Custom (CXC) program, and its design makes it suitable for integration with other SIP cores (e.g. GPU, display controller, DSP, image processor, etc.) into one die constituting a system on a chip (SoC).[1][2]

Usage

[edit]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The ARM Cortex-X1 is a high-performance CPU core developed by as part of its Cortex-X series, implementing the Armv8.2-A 64-bit architecture with support for extensions including v8.3-A (load-acquire/store-release), v8.4-A (), and v8.5-A (traps, speculative store bypass safeguards). It features a superscalar, variable-length, out-of-order optimized for demanding workloads, integrated within Arm's DynamIQ Shared Unit (DSU) for flexible multi-core configurations, such as clusters combining one to four Cortex-X1 cores with efficiency cores like the Cortex-A78 and Cortex-A55. Announced on May 26, 2020, the Cortex-X1 was introduced through Arm's Cortex-X Custom Program to enable partners to tailor high-performance solutions for smartphones, laptops, and other devices, marking a shift toward modular "big" cores focused on peak performance rather than broad efficiency. Key architectural enhancements include a 25% increase in decode bandwidth (up to 5 , with 8 via macro-op cache), 33% higher macro-op cache throughput, and doubled capacity in the SIMD engine for improved and workloads. These upgrades deliver up to 30% higher peak performance compared to the Cortex-A77 and 22% single-thread uplift over the Cortex-A78, while supporting larger caches like 64 KiB L1 instruction and per core, up to 1 MiB private L2, and 8 MiB shared L3 in a DSU cluster. The core also incorporates advanced features for reliability and profiling, such as the (RAS) extension, Statistical Profiling Extension (SPE), and a level-1 memory system with private L2 cache, enabling its use in safety-critical and high-throughput applications. It has been licensed for integration into flagship system-on-chips (SoCs), powering devices with exceptional single-threaded tasks like gaming, AI processing, and .

Introduction

Overview

The ARM Cortex-X1 is a high-performance 64-bit CPU core implementing the ARMv8.2-A , along with extensions such as ARMv8.3-A (LDAPR), ARMv8.4-A (), ARMv8.5-A (traps, SSBS, speculation barriers), (RAS), and Statistical Profiling Extension (SPE). Designed at ARM's Austin center and launched on May 26, 2020, it represents the first implementation in the Cortex-X Custom (CXC) program, which focuses on customizable high-performance cores for premium mobile devices, laptops, and other computing platforms requiring peak single-thread execution. The Cortex-X1 prioritizes maximum throughput for demanding workloads, delivering up to 30% higher peak performance compared to the Cortex-A77 while emphasizing energy efficiency in high-end applications. It also achieves up to 100% improvement in machine learning performance over the same predecessor, enabling advanced AI processing on mobile SoCs. These gains stem from architectural optimizations tailored for single-threaded peak performance rather than broad efficiency. Key specifications include support for clock speeds up to 3.0 GHz in smartphones and 3.3 GHz in tablets and laptops, 40-bit physical addressing, and configurations of 1 to 4 cores per DynamIQ cluster for flexible heterogeneous integration.

Development and announcement

In 2016, ARM introduced the "Built on Cortex" licensing model, which extended the standard Cortex architecture license to enable partners to create customized high-performance CPU designs while maintaining compatibility with the broader ARM ecosystem. This initiative laid the groundwork for more flexible IP offerings, allowing licensees to optimize for specific performance targets beyond the conventional Cortex-A series roadmap. The Cortex-X Custom program, announced in 2020, built directly on this foundation, focusing on synthesizable IP blocks that facilitate easier integration into system-on-chip designs for premium devices. Development of the Cortex-X1, the inaugural core under the Cortex-X Custom program, began around 2018-2019 by ARM's design team in , aiming to push the boundaries of mobile CPU performance. The core was officially announced on May 26, 2020, alongside the Cortex-A78, as part of ARM's 2020 mobile IP portfolio. This reveal emphasized the Cortex-X1's role in delivering peak performance gains, targeting up to 30% improvement over the Cortex-A77 for demanding tasks. The primary motivation behind the Cortex-X1 was to address escalating requirements for smartphones and large-screen devices in the era, including advanced AI processing, immersive gaming, and enhanced productivity applications, all without disrupting 's ecosystem-wide efficiency and scalability. By providing a customizable, high-end option through the Cortex-X program, enabled partners to tailor performance envelopes while leveraging proven ARMv8.2-A architecture compatibility. The design process prioritized synthesizable IP to streamline adoption, ensuring faster time-to-market for licensees. Initial availability for the Cortex-X1 included readiness for partners in late 2020, with the first commercial products incorporating appearing in early 2021, such as in Samsung's 2100 SoC. This timeline reflected ARM's strategy to accelerate deployment of next-generation mobile compute solutions amid rising demands for digital immersion.

Microarchitecture

Pipeline and execution units

The ARM Cortex-X1 employs a superscalar, design featuring a 5-wide decode stage, which expands to 8-wide operation through its macro-OP (MOP) cache mechanism. This front-end configuration enables the core to fetch up to 5 instructions or 8 MOPs per cycle from the instruction cache or MOP cache, respectively, optimizing throughput for variable-length ARM instructions. The core includes a 3K-entry MOP cache to store pre-decoded operations, reducing decode pressure and improving efficiency for frequently executed code sequences. Following decode, the rename and dispatch stages support up to 8 or 16 micro-operations (μOPs) per cycle, with limitations on specific instruction types to maintain balance across the backend. The reorder buffer provides 224 entries, enabling a significantly widened window compared to prior generations, which allows for greater . The execution spans 13 stages overall, with 15 ports distributing operations to specialized units for low-latency processing; mispredictions incur a 10-stage penalty to flush . Integer execution leverages multiple arithmetic logic units (ALUs) for address generation, shifts, and basic operations, while floating-point and vector processing utilize dedicated units supporting scalar and vector instructions. Load/store operations are handled via multiple ports, with up to three loads and two stores dispatched per cycle to minimize memory access bottlenecks. For SIMD workloads, particularly acceleration, the core incorporates four 128-bit units, doubling the vector throughput relative to the Cortex-A78 and enabling efficient parallel processing of 128-bit wide vectors.

Cache hierarchy and memory subsystem

The ARM Cortex-X1 employs a optimized for high-performance workloads. At the first level, each core features a 64 KiB instruction cache with optional parity protection for error detection and a 64 KiB cache, providing a total of 128 KiB of L1 cache per core. The L1 caches are 4-way set associative with 64-byte line sizes, enabling efficient instruction fetch and access while maintaining low latency. The second level consists of a private L2 cache per core, configurable between 512 KiB and 1024 KiB, which serves as a unified cache for both instructions and data. This L2 cache is 8-way set associative and includes bandwidth optimizations, such as doubled throughput compared to prior generations, to support sustained data flow in demanding applications. An optional shared L3 cache, up to 8 MiB per DynamIQ cluster, further extends the by providing a larger pool for inter-core and reducing external memory accesses. The memory subsystem utilizes 40-bit physical addressing to access up to 1 TiB of memory space and supports interfaces compatible with DDR4, LPDDR4, and LPDDR5 DRAM types via the system's interconnect protocols. Bandwidth enhancements in the subsystem, including increased L1 data and L2 cache throughput, target high-throughput workloads by minimizing stalls during memory-intensive operations. Cache coherency is maintained through full ARMv8 compliance, incorporating a within the DynamIQ Shared Unit (DSU) to handle multi-core and invalidate operations efficiently. This ensures data consistency across cores without software intervention in cluster-based configurations. Power management in the memory subsystem integrates dynamic voltage and (DVFS), which adjusts based on memory access patterns and cache miss rates to balance and energy efficiency. These mechanisms, tied to activity monitors, allow fine-grained control over power states during varying workloads. The expanded cache sizes in the Cortex-X1 contribute to improved single-threaded over earlier Cortex-A cores by reducing average .

Architectural enhancements

Innovations over prior cores

The ARM Cortex-X1 introduced a macro-OP () cache designed to alleviate decode bottlenecks in complex code paths by fusing multiple instructions into larger operations prior to caching, enabling the core to dispatch up to 8 per cycle compared to 6 in prior cores like the Cortex-A77. This enhancement doubles the cache capacity to 3,000 entries, improving instruction throughput and reducing front-end pressure in workloads with intricate dependencies. To better support AI and machine learning tasks such as neural network inference, the Cortex-X1 expanded its SIMD capabilities by doubling the NEON execution pipelines to 4x128-bit units from 2x128-bit in the Cortex-A77, thereby increasing vector processing bandwidth for parallel computations. This upgrade facilitates higher throughput in floating-point and integer vector operations, contributing to up to 100% faster machine learning performance over previous generations. Branch prediction in the Cortex-X1 was enhanced with a larger Branch Target Buffer (BTB) expanded by 50% to 96 entries and integration of a predictor with extended history tables, improving accuracy for irregular patterns in real-world applications. These modifications reduce misprediction penalties by capturing longer branch histories, leading to more reliable speculation in . Despite its emphasis on peak performance through wider execution resources, the Cortex-X1 incorporates power efficiency measures such as fine-grained across stages and multiple voltage domains to minimize active power in underutilized units. This balances the core's aggressive design with targeted energy savings for mobile use cases, though with higher overall power consumption than efficiency-focused cores. The core supports the ARMv8.2-A , including dot product instructions from the v8.4-A extension, which accelerate matrix multiplications essential for ML inference by enabling efficient accumulation of vector products in a single cycle. These extensions provide foundational enhancements for emerging workloads without requiring custom ISA modifications.

Differences from Cortex-A78

The ARM Cortex-X1 and Cortex-A78 both implement the ARMv8.2-A architecture, but the X1 incorporates several microarchitectural enhancements targeted at peak performance, contrasting with the A78's emphasis on balanced efficiency. A key difference lies in the front-end decode width, where the Cortex-X1 supports a 5-wide decode, compared to the 4-wide decode in the Cortex-A78, allowing the X1 to process more and achieve higher (IPC) in performance-critical workloads. In the execution backend, the Cortex-X1 features a larger window of 224 entries, versus 160 entries in the Cortex-A78, which enables greater by tracking and reordering more operations simultaneously. Additionally, the X1 doubles the SIMD throughput with four 128-bit units, compared to two in the A78, resulting in up to twice the inference performance for vectorized tasks. These changes contribute to a 30% uplift in relative to prior designs, with the X1 delivering approximately 22% higher single-thread than the A78 under comparable conditions. Cache configurations also differ to support sustained high-throughput workloads in the X1, with mandatory 64 KB L1 instruction and caches, and scalable L2 up to 1 MB per core, in contrast to the A78's flexible 32/64 KB L1 options and smaller balanced L2 sizing up to 512 KB. This scaling aids the X1 in maintaining during prolonged compute-intensive operations. Overall, while the Cortex-X1 prioritizes peak throughput—offering up to 22% faster than the A78—it does so at the cost of higher power consumption, whereas the A78 optimizes for efficiency in sustained scenarios with lower area and energy use.

System integration

DynamIQ compatibility

The ARM Cortex-X1 is designed for integration within ARM's DynamIQ architecture, which utilizes the DynamIQ Shared Unit (DSU) to form flexible CPU clusters that support heterogeneous combinations of high-performance and efficiency cores. The DSU enables the Cortex-X1 to be mixed with Cortex-A78 performance cores and Cortex-A55 efficiency cores in big.LITTLE configurations, allowing system designers to tailor multi-core setups for optimal balance between peak performance and power efficiency. This compatibility extends the traditional big.LITTLE paradigm by permitting greater flexibility in core placement across clusters, facilitated by the DSU's management of shared resources and interfaces. Cluster configurations for the Cortex-X1 support up to four X1 cores per DSU-managed cluster, sharing a unified L3 cache configurable up to 8 MiB in size. This setup provides low-latency access to the shared L3 for coherence and bandwidth optimization within the cluster, while the DSU handles snoop control and filtering to maintain data consistency among cores. For larger systems involving multiple DynamIQ clusters, the CoreLink CMN-600 coherent mesh interconnect ensures scalable connectivity, supporting high-bandwidth communication in expansive big.LITTLE arrangements without compromising coherence. The benefits of this DynamIQ integration are particularly evident in tri-cluster designs, such as one comprising a single Cortex-X1 core for bursty workloads, three Cortex-A78 cores for sustained tasks, and four Cortex-A55 cores for background efficiency, delivering overall performance improvements while adapting to varying computational demands. Such configurations leverage the DSU's resource sharing to enhance system-level efficiency without requiring rigid homogeneous groupings. Security features like ARM TrustZone and pointer authentication are seamlessly integrated at the cluster level through the DSU, which provides secure monitoring, interrupt routing, and memory partitioning to isolate secure and non-secure worlds across mixed-core environments. TrustZone ensures hardware-enforced separation of execution environments, while pointer authentication, supported natively in the Cortex-X1's Armv8.3-A implementation, protects with cryptographic signing of pointers, with the DSU facilitating secure propagation of these mechanisms throughout the cluster.

Variants and configurations

The Cortex-X1 core has one primary derivative variant, the Cortex-X1C, announced in November 2021 and optimized for high-performance applications in laptops and desktops. This variant builds on the base Cortex-X1 microarchitecture while incorporating enhancements for scalability and security, including support for Pointer Authentication Codes (PAC) as defined in Armv8.3-A and Armv8.6-A extensions, which mitigate common exploitation techniques such as (ROP) by over 60% and jump-oriented programming (JOP) by over 50%. The Cortex-X1C enables configurations with up to eight high-performance cores in a single DynamIQ cluster, paired with an updated DynamIQ Shared Unit (DSU) that supports up to 8 MB of shared L3 cache, making it suitable for multi-day battery life in always-connected devices. Configuration options for the Cortex-X1 and its X1C variant emphasize flexibility within the DynamIQ framework, allowing scalable core counts from one to eight per cluster to balance and power efficiency. Partners can select optional shared L3 cache sizes up to 8 MB, with the base core featuring a private 1 MB L2 cache, while the design supports advanced process nodes at 5 nm and below for improved density and efficiency, as demonstrated in implementations like the Samsung Exynos 2100 and Qualcomm Snapdragon 888. Clock speeds are tunable up to 3.3 GHz, particularly in laptop-oriented configurations like the X1C, to achieve peak single-threaded while maintaining thermal limits. Power and thermal tuning parameters are provided during IP delivery to enable trade-offs between area, performance, and efficiency, influencing overall die area and manufacturing yield. For instance, adjustments to cache sizes and widths allow licensees to prioritize either maximum throughput or reduced power consumption, with the X1C variant offering 22% higher performance than the comparable Cortex-A78C under similar thermal envelopes. No other major variants of the Cortex-X1 exist beyond the X1C, focusing implementations on these configurable aspects to suit diverse system requirements.

Commercial aspects

Licensing model

The ARM Cortex-X1 is offered under ARM's Architectural License through the Cortex-X Custom (CXC) program, an extension that permits partners to make semi-custom modifications to the core design for specific performance optimizations while mandating the retention of branding. This licensing framework builds on the 2016 "Built on Cortex" program, which introduced options for performance-oriented customizations beyond standard off-the-shelf cores. The pricing model consists of upfront licensing fees and per-unit royalties, with terms varying by agreement, scope of access, and production volume; rates are typically lower for high-volume mobile deployments compared to low-volume applications. Availability began in 2020, with the core provided as synthesizable (RTL) in , often requiring non-disclosure agreements for by qualified partners. Key restrictions stipulate that products incorporating the Cortex-X1 must use the official "Arm Cortex-X1" designation in marketing and documentation, and full redesigns of the core are not permitted without obtaining a more advanced architectural license. Within the DynamIQ ecosystem, this model facilitates configurable big.LITTLE cluster integrations.

Customization and availability

The ARM Cortex-X1 is delivered to licensees as synthesizable (IP) in (RTL) format, including comprehensive simulation models and integration guides optimized for advanced manufacturing process nodes from key foundries such as and . This delivery mechanism enables partners to incorporate the core into custom system-on-chip (SoC) designs with relative ease, supporting rapid prototyping and verification workflows. Customization of the Cortex-X1 occurs at multiple levels through the Cortex-X Custom (CXC) program, which extends beyond standard parameterizable options like adjustable L2 cache sizes (ranging from 128 KiB to 1 MiB) and clock domain configurations to permit deeper microarchitectural modifications tailored to specific workload demands, such as enhanced branch prediction or scaling. These options allow partners to balance peak performance against power and area constraints while maintaining compatibility with the Armv8.2-A architecture. The CXC program facilitates this differentiation by providing access to Arm's design expertise for co-optimization, ensuring implementations meet unique application requirements without deviating from core reliability standards. Supporting tools for Cortex-X1 development include Arm Fast Models, which offer cycle-approximate simulations for early software bring-up and validation prior to hardware availability, and the Arm Development Studio suite—featuring Streamline for performance analysis and debugging. These tools integrate seamlessly with popular EDA environments from partners like and , accelerating verification and optimization cycles. General availability of the Cortex-X1 IP followed its announcement on May 26, 2020, with initial tape-outs enabling commercial SoC shipments later that year; subsequent revisions have included optimizations for advanced nodes such as 4 nm and below in implementations as of 2022, to sustain relevance in high-performance mobile and edge applications. The support ecosystem encompasses reference designs for DynamIQ clusters, which demonstrate heterogeneous integration of the Cortex-X1 with efficiency cores like Cortex-A78 or A55, complete with interconnect configurations via the DynamIQ Shared Unit (DSU) for streamlined cluster-level deployment.

Adoption

System-on-chip implementations

The ARM Cortex-X1 core has been integrated into several flagship system-on-chip (SoC) designs as the high-performance "prime" core in heterogeneous big.LITTLE configurations, typically arranged in a 1+3+4 cluster setup to balance peak performance and efficiency. This places the single Cortex-X1 core at the highest clock speeds for demanding tasks, paired with mid-tier performance cores and efficiency cores for lighter workloads. Qualcomm's Snapdragon 888, announced in December 2020, features a custom Kryo 680 Prime based on the Cortex-X1 , clocked at up to 2.84 GHz, alongside three Cortex-A78 cores at 2.42 GHz and four Cortex-A55 cores at 1.8 GHz, with an 660 GPU for graphics processing. Samsung's Exynos 2100, unveiled in January 2021, incorporates a single custom Cortex-X1 core clocked at up to 2.91 GHz, combined with three Cortex-A78 cores at 2.81 GHz and four Cortex-A55 cores at 2.2 GHz, integrated with a Mali-G78 MP14 GPU; this SoC powers devices like the Galaxy S21 series. Google's Tensor G1, introduced in October 2021 for the series, deviates slightly from the standard configuration by using two Cortex-X1 cores at 2.8 GHz, paired with two Cortex-A76 cores at 2.25 GHz and four Cortex-A55 cores at 1.8 GHz, alongside a Mali-G78 MP20 GPU and a custom (TPU) for AI acceleration. Qualcomm's Snapdragon G3x Gen 1, announced in December 2021 for handheld gaming platforms, features a custom 680 Prime core based on the Cortex-X1 architecture clocked at up to 3.0 GHz, with three Cortex-A78 cores and four Cortex-A55 cores, paired with an 660 GPU optimized for gaming.

End-user devices

The ARM Cortex-X1 core found its primary application in flagship smartphones launched between 2021 and 2022, powering system-on-chips (SoCs) from major vendors and enabling high-performance computing in the Android ecosystem. Notable examples include Google's Pixel 6 and Pro, which utilized the custom Tensor G1 SoC featuring two Cortex-X1 cores clocked at up to 2.8 GHz for demanding tasks like AI processing and photography. Similarly, Samsung's Galaxy S21 series incorporated the Exynos 2100 SoC with a single Cortex-X1 core at 2.9 GHz in select regions, enhancing single-threaded performance for applications such as video editing and multitasking. Other devices, including the GT powered by Qualcomm's Snapdragon 888 SoC (with one Cortex-X1 core at 2.84 GHz), brought this architecture to more affordable premium segments, broadening access to advanced mobile capabilities. Beyond smartphones, adoption of the Cortex-X1 in end-user devices like tablets and laptops remained limited, primarily through the power-optimized Cortex-X1C variant designed for such form factors. While Arm positioned the X1C for potential use in Windows on ARM laptops and tablets to deliver efficient , actual implementations were sparse, with no major commercial releases identified as of 2025. This constrained footprint contrasted with the core's smartphone success, where it contributed to devices competing directly with Apple's A-series chips in raw processing power. The Cortex-X1 also saw limited adoption in handheld gaming devices, such as the released in January 2023, which uses the Snapdragon G3x Gen 1 SoC with a single Cortex-X1 core at up to 3.0 GHz to support and Android titles. In the 2021-2022 Android market, the Cortex-X1 significantly elevated flagship device performance, particularly in single-threaded workloads that benefited from its 30% IPC uplift over prior Cortex-A77 cores, allowing smoother gaming and improved emulation of console titles. Real-world benchmarks, such as single-core scores exceeding 1,100 on Snapdragon 888 devices, underscored its edge in tasks like running emulators for or PlayStation games at higher frame rates compared to predecessors. This helped Android flagships close the gap with devices in CPU-intensive scenarios, driving market enthusiasm for Arm's performance-focused shift. However, adoption waned post-2022 as manufacturers transitioned to successors like Cortex-X2 and X3 for even greater efficiency gains. As of 2025, the Cortex-X1 persists as a legacy component in older flagship smartphones still in use, such as the series and Galaxy S21 models, as well as the gaming handheld, but no significant new device integrations have occurred since 2023, reflecting the rapid evolution toward newer architectures in .

References

Add your contribution
Related Hubs
User Avatar
No comments yet.