Hubbry Logo
ARM architecture familyARM architecture familyMain
Open search
ARM architecture family
Community hub
ARM architecture family
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
ARM architecture family
ARM architecture family
from Wikipedia

ARM
Designer
Bits32-bit, 64-bit
Introduced1985; 40 years ago (1985)
DesignRISC
TypeLoad–store
BranchingCondition code, compare and branch
OpenNo; proprietary
ARM AArch64 (64/32-bit)
Introduced2011; 14 years ago (2011)
VersionARMv8-R, ARMv8-A, ARMv8.1-A, ARMv8.2-A, ARMv8.3-A, ARMv8.4-A, ARMv8.5-A, ARMv8.6-A, ARMv8.7-A, ARMv8.8-A, ARMv8.9-A, ARMv9.0-A, ARMv9.1-A, ARMv9.2-A, ARMv9.3-A, ARMv9.4-A, ARMv9.5-A, ARMv9.6-A
EncodingAArch64/A64 and AArch32/A32 use 32-bit instructions, AArch32/T32 (Thumb-2) uses mixed 16- and 32-bit instructions[1]
EndiannessBi (little as default)
ExtensionsSVE, SVE2, SME, AES, SM3, SM4, SHA, CRC32, RNDR, TME; All mandatory: Thumb-2, Neon, VFPv4-D16, VFPv4; obsolete: Thumb and Jazelle
Registers
General-purpose31 × 64-bit integer registers[1]
Floating-point32 × 128-bit registers[1] for scalar 32- and 64-bit FP or SIMD FP or integer; or cryptography
ARM AArch32 (32-bit)
VersionARMv9-R, ARMv9-M, ARMv8-R, ARMv8-M, ARMv7-A, ARMv7-R, ARMv7E-M, ARMv7-M
Encoding32-bit, except Thumb-2 extensions use mixed 16- and 32-bit instructions.
EndiannessBi (little as default)
ExtensionsThumb, Thumb-2, Neon, Jazelle, AES, SM3, SM4, SHA, CRC32, RNDR, DSP, Saturated, FPv4-SP, FPv5, Helium; obsolete since ARMv8: Thumb and Jazelle
Registers
General-purpose15 × 32-bit integer registers, including R14 (link register), but not R15 (PC)
Floating-pointUp to 32 × 64-bit registers,[2] SIMD/floating-point (optional)
ARM 32-bit (legacy)
VersionARMv6, ARMv5, ARMv4T, ARMv3, ARMv2
Encoding32-bit, except Thumb extension uses mixed 16- and 32-bit instructions.
EndiannessBi (little as default) in ARMv3 and above
ExtensionsThumb, Jazelle
Registers
General-purpose15 × 32-bit integer registers, including R14 (link register), but not R15 (PC, 26-bit addressing in older)
Floating-pointNone

ARM (stylised in lowercase as arm, formerly an acronym for Advanced RISC Machines and originally Acorn RISC Machine) is a family of RISC instruction set architectures for computer processors. Arm Holdings develops the instruction set architecture and licenses them to other companies, who build the physical devices that use the instruction set. It also designs and licenses cores that implement these instruction set architectures.

Due to their low costs, low power consumption, and low heat generation, ARM processors are useful for light, portable, battery-powered devices, including smartphones, laptops, and tablet computers, as well as embedded systems.[3][4][5] However, ARM processors are also used for desktops and servers, including Fugaku, the world's fastest supercomputer from 2020[6] to 2022. With over 230 billion ARM chips produced,[7][8] since at least 2003, and with its dominance increasing every year, ARM is the most widely used family of instruction set architectures.[9][4][10][11][12]

There have been several generations of the ARM design. The original ARM1 used a 32-bit internal structure but had a 26-bit address space that limited it to 64 MB of main memory. This limitation was removed in the ARMv3 series, which has a 32-bit address space, and several additional generations up to ARMv7 remained 32-bit. Released in 2011, the ARMv8-A architecture added support for a 64-bit address space and 64-bit arithmetic with its new 32-bit fixed-length instruction set.[13] Arm Holdings has also released a series of additional instruction sets for different roles: the "Thumb" extensions add both 32- and 16-bit instructions for improved code density, while Jazelle added instructions for directly handling Java bytecode. More recent changes include the addition of simultaneous multithreading (SMT) for improved performance or fault tolerance.[14]

History

[edit]

BBC Micro

[edit]

Acorn Computers' first widely successful design was the BBC Micro, introduced in December 1981. This was a relatively conventional machine based on the MOS Technology 6502 CPU but ran at roughly double the performance of competing designs like the Apple II due to its use of faster dynamic random-access memory (DRAM). Typical DRAM of the era ran at about 2 MHz; Acorn arranged a deal with Hitachi for a supply of faster 4 MHz parts.[15]

Machines of the era generally shared memory between the processor and the framebuffer, which allowed the processor to quickly update the contents of the screen without having to perform separate input/output (I/O). As the timing of the video display is exacting, the video hardware had to have priority access to that memory. Due to a quirk of the 6502's design, the CPU left the memory untouched for half of the time. Thus by running the CPU at 1 MHz, the video system could read data during those down times, taking up the total 2 MHz bandwidth of the RAM. In the BBC Micro, the use of 4 MHz RAM allowed the same technique to be used, but running at twice the speed. This allowed it to outperform any similar machine on the market.[16]

Acorn Business Computer

[edit]

1981 was also the year that the IBM Personal Computer was introduced. Using the recently introduced Intel 8088, a 16-bit CPU compared to the 6502's 8-bit design, it offered higher overall performance. Its introduction changed the desktop computer market radically: what had been largely a hobby and gaming market emerging over the prior five years began to change to a must-have business tool where the earlier 8-bit designs simply could not compete. Even newer 32-bit designs were also coming to market, such as the Motorola 68000[17] and National Semiconductor NS32016.[18]

Acorn began considering how to compete in this market and produced a new paper design named the Acorn Business Computer. They set themselves the goal of producing a machine with ten times the performance of the BBC Micro, but at the same price.[19] This would outperform and underprice the PC. At the same time, the recent introduction of the Apple Lisa brought the graphical user interface (GUI) concept to a wider audience and suggested the future belonged to machines with a GUI.[20] The Lisa, however, cost $9,995, as it was packed with support chips, large amounts of memory, and a hard disk drive, all very expensive then.[21]

The engineers then began studying all of the CPU designs available. Their conclusion about the existing 16-bit designs was that they were a lot more expensive and were still "a bit crap",[22] offering only slightly higher performance than their BBC Micro design. They also almost always demanded a large number of support chips to operate even at that level, which drove up the cost of the computer as a whole. These systems would simply not hit the design goal.[22] They also considered the new 32-bit designs, but these cost even more and had the same issues with support chips.[23] According to Sophie Wilson, all the processors tested at that time performed about the same, with about a 4 Mbit/s bandwidth.[24][a]

Two key events led Acorn down the path to ARM. One was the publication of a series of reports from the University of California, Berkeley, which suggested that a simple chip design could nevertheless have extremely high performance, much higher than the latest 32-bit designs on the market.[25] The second was a visit by Steve Furber and Sophie Wilson to the Western Design Center, a company run by Bill Mensch and his sister Kathryn,[26] which had become the logical successor to the MOS team and was offering new versions like the WDC 65C02. The Acorn team saw high school students producing chip layouts on Apple II machines, which suggested that anyone could do it.[27][28] In contrast, a visit to another design firm working on modern 32-bit CPU revealed a team with over a dozen members who were already on revision H of their design and yet it still contained bugs.[b] This cemented their late 1983 decision to begin their own CPU design, the Acorn RISC Machine.[29]

Design concepts

[edit]

The original Berkeley RISC designs were in some sense teaching systems, not designed specifically for outright performance. To the RISC's basic register-heavy and load/store concepts, ARM added a number of the well-received design notes of the 6502. Primary among them was the ability to quickly service interrupts, which allowed the machines to offer reasonable input/output performance with no added external hardware. To offer interrupts with similar performance as the 6502, the ARM design limited its physical address space to 64 MB of total addressable space, requiring 26 bits of address. As instructions were 4 bytes (32 bits) long, and required to be aligned on 4-byte boundaries, the lower 2 bits of an instruction address were always zero. This meant the program counter (PC) only needed to be 24 bits, allowing it to be stored along with the eight bit processor flags in a single 32-bit register. That meant that upon receiving an interrupt, the entire machine state could be saved in a single operation, whereas had the PC been a full 32-bit value, it would require separate operations to store the PC and the status flags. This decision halved the interrupt overhead.[30]

Another change, and among the most important in terms of practical real-world performance, was the modification of the instruction set to take advantage of page mode DRAM. Recently introduced, page mode allowed subsequent accesses of memory to run twice as fast if they were roughly in the same location, or "page", in the DRAM chip. Berkeley's design did not consider page mode and treated all memory equally. The ARM design added special vector-like memory access instructions, the "S-cycles", that could be used to fill or save multiple registers in a single page using page mode. This doubled memory performance when they could be used, and was especially important for graphics performance.[31]

The Berkeley RISC designs used register windows to reduce the number of register saves and restores performed in procedure calls; the ARM design did not adopt this.

Wilson developed the instruction set, writing a simulation of the processor in BBC BASIC that ran on a BBC Micro with a second 6502 processor.[32][33] This convinced Acorn engineers they were on the right track. Wilson approached Acorn's CEO, Hermann Hauser, and requested more resources. Hauser gave his approval and assembled a small team to design the actual processor based on Wilson's instruction set architecture.[34] The official Acorn RISC Machine project started in October 1983.

ARM1

[edit]
ARM1 2nd processor for the BBC Micro

Acorn chose VLSI Technology as the "silicon partner", as they were a source of ROMs and custom chips for Acorn. Acorn provided the design and VLSI provided the layout and production. The first samples of ARM silicon worked properly when first received and tested on 26 April 1985.[3] Known as ARM1, these versions ran at 6 MHz.[35]

The first ARM application was as a second processor for the BBC Micro, where it helped in developing simulation software to finish development of the support chips (VIDC, IOC, MEMC), and sped up the CAD software used in ARM2 development. Wilson subsequently rewrote BBC BASIC in ARM assembly language. The in-depth knowledge gained from designing the instruction set enabled the code to be very dense, making ARM BBC BASIC an extremely good test for any ARM emulator.

ARM Evaluations Systems featuring ARM1 CPUs and supplied as a second processors for BBC Micro and Master machines, were made available from July 1986[36] under the Acorn OEM Products brand to developers and researchers[37].

The A500 Second Processor, an additional ARM1 based BBC Micro and Master second processor, featured the ARM support chipset (VIDC, IOC, MEMC), was capable of producing video output[38] and operating near independently of the host BBC Micro.

ARM2

[edit]

The result of the simulations on the ARM1 boards led to the late 1986 introduction of the ARM2 design running at 8 MHz, and the early 1987 speed-bumped version at 10 to 12 MHz.[c] A significant change in the underlying architecture was the addition of a Booth multiplier, whereas formerly multiplication had to be carried out in software.[40] Further, a new Fast Interrupt reQuest mode, FIQ for short, allowed registers 8 to 14 to be replaced as part of the interrupt itself. This meant FIQ requests did not have to save out their registers, further speeding interrupts.[41]

The first use of the ARM2 was in internal Acorn A500 development machines,[42] and the Acorn Archimedes personal computer models A305, A310, and A440, launched on the 6th June 1987.

According to the Dhrystone benchmark, the ARM2 was roughly seven times the performance of a typical 7 MHz 68000-based system like the Amiga or Macintosh SE. It was twice as fast as an Intel 80386 running at 16 MHz, and about the same speed as a multi-processor VAX-11/784 superminicomputer. The only systems that beat it were the Sun SPARC and MIPS R2000 RISC-based workstations.[43] Further, as the CPU was designed for high-speed I/O, it dispensed with many of the support chips seen in these machines; notably, it lacked any dedicated direct memory access (DMA) controller which was often found on workstations. The graphics system was also simplified based on the same set of underlying assumptions about memory and timing. The result was a dramatically simplified design, offering performance on par with expensive workstations but at a price point similar to contemporary desktops.[43]

The ARM2 featured a 32-bit data bus, 26-bit address space and 27 32-bit registers, of which 16 are accessible at any one time (including the PC).[44] The ARM2 had a transistor count of just 30,000,[45] compared to Motorola's six-year-older 68000 model with around 68,000. Much of this simplicity came from the lack of microcode, which represents about one-quarter to one-third of the 68000's transistors, and the lack of (like most CPUs of the day) a cache. This simplicity enabled the ARM2 to have a low power consumption and simpler thermal packaging by having fewer powered transistors. Nevertheless, ARM2 offered better performance than the contemporary 1987 IBM PS/2 Model 50, which initially utilised an Intel 80286, offering 1.8 MIPS @ 10 MHz, and later in 1987, the 2 MIPS of the PS/2 70, with its Intel 386 DX @ 16 MHz.[46][47]

A successor, ARM3, was produced with a 4 KB cache, which further improved performance.[48] The address bus was extended to 32 bits in the ARM6, but program code still had to lie within the first 64 MB of memory in 26-bit compatibility mode, due to the reserved bits for the status flags.[49]

Advanced RISC Machines Ltd. – ARM6

[edit]
Microprocessor-based system on a chip
Die shot of an ARM610 microprocessor

In the late 1980s, Apple Computer and VLSI Technology started working with Acorn on newer versions of the ARM core. In 1990, Acorn spun off the design team into a new company named Advanced RISC Machines Ltd.,[50][51][52] which became ARM Ltd. when its parent company, Arm Holdings plc, floated on the London Stock Exchange and Nasdaq in 1998.[53] The new Apple–ARM work would eventually evolve into the ARM6, first released in early 1992. Apple used the ARM6-based ARM610 as the basis for their Apple Newton PDA.

Early licensees

[edit]

In 1994, Acorn used the ARM610 as the main central processing unit (CPU) in their RiscPC computers. DEC licensed the ARMv4 architecture and produced the StrongARM.[54] At 233 MHz, this CPU drew only one watt (newer versions draw far less). This work was later passed to Intel as part of a lawsuit settlement, and Intel took the opportunity to supplement their i960 line with the StrongARM. Intel later developed its own high performance implementation named XScale, which it has since sold to Marvell. Transistor count of the ARM core remained essentially the same throughout these changes; ARM2 had 30,000 transistors,[55] while ARM6 grew only to 35,000.[56]

Market share

[edit]

In 2005, about 98% of all mobile phones sold used at least one ARM processor.[57] In 2010, producers of chips based on ARM architectures reported shipments of 6.1 billion ARM-based processors, representing 95% of smartphones, 35% of digital televisions and set-top boxes, and 10% of mobile computers. In 2011, the 32-bit ARM architecture was the most widely used architecture in mobile devices and the most popular 32-bit one in embedded systems.[58] In 2013, 10 billion were produced[59] and "ARM-based chips are found in nearly 60 percent of the world's mobile devices".[60]

Licensing

[edit]
Die shot of a STM32­F103VGT6 ARM Cortex-M3 microcontroller with 1 MB flash memory by STMicroelectronics

Core licence

[edit]

Arm Holdings's primary business is selling IP cores, which licensees use to create microcontrollers (MCUs), CPUs, and systems-on-chips based on those cores. The original design manufacturer combines the ARM core with other parts to produce a complete device, typically one that can be built in existing semiconductor fabrication plants (fabs) at low cost and still deliver substantial performance. The most successful implementation has been the ARM7TDMI with hundreds of millions sold. Atmel has been a precursor design center in the ARM7TDMI-based embedded system.

The ARM architectures used in smartphones, PDAs and other mobile devices range from ARMv5 to ARMv8-A.

In 2009, some manufacturers introduced netbooks based on ARM architecture CPUs, in direct competition with netbooks based on Intel Atom.[61]

Arm Holdings offers a variety of licensing terms, varying in cost and deliverables. Arm Holdings provides to all licensees an integratable hardware description of the ARM core as well as complete software development toolset (compiler, debugger, software development kit), and the right to sell manufactured silicon containing the ARM CPU.

SoC packages integrating ARM's core designs include Nvidia Tegra's first three generations, CSR plc's Quatro family, ST-Ericsson's Nova and NovaThor, Silicon Labs's Precision32 MCU, Texas Instruments's OMAP products, Samsung's Hummingbird and Exynos products, Apple's A4, A5, and A5X, and NXP's i.MX.

Fabless licensees, who wish to integrate an ARM core into their own chip design, are usually only interested in acquiring a ready-to-manufacture verified semiconductor intellectual property core. For these customers, Arm Holdings delivers a gate netlist description of the chosen ARM core, along with an abstracted simulation model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, choose to acquire the processor IP in synthesizable RTL (Verilog) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimisations and extensions. This allows the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power consumption, instruction set extensions, etc.). While Arm Holdings does not grant the licensee the right to resell the ARM architecture itself, licensees may freely sell manufactured products such as chip devices, evaluation boards and complete systems. Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing ARM cores, they generally hold the right to re-manufacture ARM cores for other customers.

Arm Holdings prices its IP based on perceived value. Lower performing ARM cores typically have lower licence costs than higher performing cores. In implementation terms, a synthesisable core costs more than a hard macro (blackbox) core. Complicating price matters, a merchant foundry that holds an ARM licence, such as Samsung or Fujitsu, can offer fab customers reduced licensing costs. In exchange for acquiring the ARM core through the foundry's in-house design services, the customer can reduce or eliminate payment of ARM's upfront licence fee.

Compared to dedicated semiconductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge two- to three-times more per manufactured wafer.[citation needed] For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidisation of the licence fee). For high volume mass-produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of ARM's NRE (non-recurring engineering) costs, making the dedicated foundry a better choice.

Companies that have developed chips with cores designed by Arm include Amazon.com's Annapurna Labs subsidiary,[62] Analog Devices, Apple, AppliedMicro (now: MACOM Technology Solutions[63]), Atmel, Broadcom, Cavium, Cypress Semiconductor, Freescale Semiconductor (now NXP Semiconductors), Huawei, Intel,[dubiousdiscuss] Maxim Integrated, Nvidia, NXP, Qualcomm, Renesas, Samsung Electronics, ST Microelectronics, Texas Instruments, and Xilinx.

Built on ARM Cortex Technology licence

[edit]

In February 2016, ARM announced the Built on ARM Cortex Technology licence, often shortened to Built on Cortex (BoC) licence. This licence allows companies to partner with ARM and make modifications to ARM Cortex designs. These design modifications will not be shared with other companies. These semi-custom core designs also have brand freedom, for example Kryo 280.

Companies that are current licensees of Built on ARM Cortex Technology include Qualcomm.[64]

Architectural licence

[edit]

Companies can also obtain an ARM architectural licence for designing their own CPU cores using the ARM instruction sets. These cores must comply fully with the ARM architecture. Companies that have designed cores that implement an ARM architecture include Apple, AppliedMicro (now: Ampere Computing), Broadcom, Cavium (now: Marvell), Digital Equipment Corporation, Intel, Nvidia, Qualcomm, Samsung Electronics, Fujitsu, and NUVIA Inc. (acquired by Qualcomm in 2021).

ARM Flexible Access

[edit]

On 16 July 2019, ARM announced ARM Flexible Access. ARM Flexible Access provides unlimited access to included ARM intellectual property (IP) for development. Per product licence fees are required once a customer reaches foundry tapeout or prototyping.[65][66]

75% of ARM's most recent IP over the last two years are included in ARM Flexible Access. As of October 2019:

  • CPUs: Cortex-A5, Cortex-A7, Cortex-A32, Cortex-A34, Cortex-A35, Cortex-A53, Cortex-R5, Cortex-R8, Cortex-R52, Cortex-M0, Cortex-M0+, Cortex-M3, Cortex-M4, Cortex-M7, Cortex-M23, Cortex-M33
  • GPUs: Mali-G52, Mali-G31. Includes Mali Driver Development Kits (DDK).
  • Interconnect: CoreLink NIC-400, CoreLink NIC-450, CoreLink CCI-400, CoreLink CCI-500, CoreLink CCI-550, ADB-400 AMBA, XHB-400 AXI-AHB
  • System Controllers: CoreLink GIC-400, CoreLink GIC-500, PL192 VIC, BP141 TrustZone Memory Wrapper, CoreLink TZC-400, CoreLink L2C-310, CoreLink MMU-500, BP140 Memory Interface
  • Security IP: CryptoCell-312, CryptoCell-712, TrustZone True Random Number Generator
  • Peripheral Controllers: PL011 UART, PL022 SPI, PL031 RTC
  • Debug & Trace: CoreSight SoC-400, CoreSight SDC-600, CoreSight STM-500, CoreSight System Trace Macrocell, CoreSight Trace Memory Controller
  • Design Kits: Corstone-101, Corstone-201
  • Physical IP: Artisan PIK for Cortex-M33 TSMC 22ULL including memory compilers, logic libraries, GPIOs and documentation
  • Tools & Materials: Socrates IP ToolingARM Design Studio, Virtual System Models
  • Support: Standard ARM Technical support, ARM online training, maintenance updates, credits toward onsite training and design reviews

Cores

[edit]
Architecture Core
bit-width
Cores Profile Refe-
rences
Arm Ltd. Third-party
ARMv1
ARM1
Classic
ARMv2
32
ARM2, ARM250, ARM3 Amber, STORM Open Soft Core[67]
Classic
ARMv3
32
ARM6, ARM7
Classic
ARMv4
32
ARM8 StrongARM, FA526, ZAP Open Source Processor Core
Classic
ARMv4T
32
ARM7TDMI, ARM9TDMI, SecurCore SC100
Classic
ARMv5TE
32
ARM7EJ, ARM9E, ARM10E XScale, FA626TE, Feroceon, PJ1/Mohawk
Classic
ARMv6
32
ARM11
Classic
ARMv6-M
32
ARM Cortex-M0, ARM Cortex-M0+, ARM Cortex-M1, SecurCore SC000
ARMv7-M
32
ARM Cortex-M3, SecurCore SC300 Apple M7 motion coprocessor
Microcontroller
ARMv7E-M
32
ARM Cortex-M4, ARM Cortex-M7
Microcontroller
ARMv8-M
32
ARM Cortex-M23,[69] ARM Cortex-M33[70]
Microcontroller
ARMv8.1-M
32
ARM Cortex-M55, ARM Cortex-M85
Microcontroller
ARMv7-R
32
ARM Cortex-R4, ARM Cortex-R5, ARM Cortex-R7, ARM Cortex-R8
ARMv8-R
32
ARM Cortex-R52
Real-time
64
ARM Cortex-R82
Real-time
ARMv7-A
32
ARM Cortex-A5, ARM Cortex-A7, ARM Cortex-A8, ARM Cortex-A9, ARM Cortex-A12, ARM Cortex-A15, ARM Cortex-A17 Qualcomm Scorpion/Krait, PJ4/Sheeva, Apple Swift (A6, A6X)
ARMv8-A
32
ARM Cortex-A32[76]
Application
64/32
ARM Cortex-A35,[77] ARM Cortex-A53, ARM Cortex-A57,[78] ARM Cortex-A72,[79] ARM Cortex-A73[80] X-Gene, Nvidia Denver 1/2, Cavium ThunderX, AMD K12, Apple Cyclone (A7)/Typhoon (A8, A8X)/Twister (A9, A9X)/Hurricane+Zephyr (A10, A10X), Qualcomm Kryo, Samsung M1/M2 ("Mongoose") /M3 ("Meerkat")
Application
ARM Cortex-A34[86]
Application
ARMv8.1-A
64/32
Cavium ThunderX2
Application
ARMv8.2-A
64/32
ARM Cortex-A55,[88] ARM Cortex-A75,[89] ARM Cortex-A76,[90] ARM Cortex-A77, ARM Cortex-A78, ARM Cortex-X1, ARM Neoverse N1 Nvidia Carmel, Samsung M4 ("Cheetah"), Fujitsu A64FX (ARMv8 SVE 512-bit)
Application
64
ARM Cortex-A65, ARM Neoverse E1 with simultaneous multithreading (SMT), ARM Cortex-A65AE[94] (also having e.g. ARMv8.4 Dot Product; made for safety critical tasks such as advanced driver-assistance systems (ADAS)) Apple Monsoon+Mistral (A11) (September 2017)
Application
ARMv8.3-A
64/32
Application
64
Apple Vortex+Tempest (A12, A12X, A12Z), Marvell ThunderX3 (v8.3+)[95]
Application
ARMv8.4-A
64/32
Application
64
ARM Neoverse V1 Apple Lightning+Thunder (A13), Apple Firestorm+Icestorm (A14, M1)
Application
ARMv8.5-A
64/32
Application
64
Application
ARMv8.6-A
64
Apple Avalanche+Blizzard (A15, M2), Apple Everest+Sawtooth (A16),[96] Apple Coll (A17), Apple Ibiza/Lobos/Palma (M3)
Application
ARMv8.7-A
64
Application
ARMv8.8-A
64
Application
ARMv8.9-A
64
Application
ARMv9.0-A
64
ARM Cortex-A510, ARM Cortex-A710, ARM Cortex-A715, ARM Cortex-X2, ARM Cortex-X3, ARM Neoverse E2, ARM Neoverse N2, ARM Neoverse V2
Application
ARMv9.1-A
64
Application
ARMv9.2-A
64
ARM Cortex-A520, ARM Cortex-A720, ARM Cortex-X4, ARM Neoverse V3,[100] ARM Cortex-X925,[101] ARM Cortex-A320[102] Apple Donan/BravaChop/Brava (Apple M4),[103] Apple Tupai/Tahiti (A18)
Application
ARMv9.3-A
64
TBA
Application
ARMv9.4-A
64
TBA
Application
ARMv9.5-A
64
TBA
Application
ARMv9.6-A
64
TBA
Application
  1. ^ a b Although most datapaths and CPU registers in the early ARM processors were 32-bit, addressable memory was limited to 26 bits; with upper bits, then, used for status flags in the program counter register.
  2. ^ a b c ARMv3 included a compatibility mode to support the 26-bit addresses of earlier versions of the architecture. This compatibility mode optional in ARMv4, and removed entirely in ARMv5.

Arm provides a list of vendors who implement ARM cores in their design (application specific standard products (ASSP), microprocessor and microcontrollers).[108]

Example applications of ARM cores

[edit]
Tronsmart MK908, a Rockchip-based quad-core Android "mini PC", with a microSD card next to it for a size comparison

ARM cores are used in a number of products, particularly PDAs and smartphones. Some computing examples are Microsoft's first generation Surface, Surface 2 and Pocket PC devices (following 2002), Apple's iPads, and Asus's Eee Pad Transformer tablet computers, and several Chromebook laptops. Others include Apple's iPhone smartphones and iPod portable media players, Canon PowerShot digital cameras, Nintendo Switch hybrid, the Wii security processor and 3DS handheld game consoles, and TomTom turn-by-turn navigation systems.

In 2005, Arm took part in the development of Manchester University's computer SpiNNaker, which used ARM cores to simulate the human brain.[109]

ARM chips are also used in Raspberry Pi, BeagleBoard, BeagleBone, PandaBoard, and other single-board computers, because they are very small, inexpensive, and consume very little power.

32-bit architecture

[edit]
An ARMv7 was used to power older versions of the popular Raspberry Pi single-board computers like this Raspberry Pi 2 from 2015.
An ARMv7 is also used to power the CuBox family of single-board computers.

The 32-bit ARM architecture (ARM32), such as ARMv7-A (implementing AArch32; see section on Armv8-A for more on it), was the most widely used architecture in mobile devices as of 2011.[58]

Since 1995, various versions of the ARM Architecture Reference Manual (see § External links) have been the primary source of documentation on the ARM processor architecture and instruction set, distinguishing interfaces that all ARM processors are required to support (such as instruction semantics) from implementation details that may vary. The architecture has evolved over time, and version seven of the architecture, ARMv7, defines three architecture "profiles":

  • A-profile, the "Application" profile, implemented by 32-bit cores in the Cortex-A series and by some non-ARM cores
  • R-profile, the "Real-time" profile, implemented by cores in the Cortex-R series
  • M-profile, the "Microcontroller" profile, implemented by most cores in the Cortex-M series

Although the architecture profiles were first defined for ARMv7, ARM subsequently defined the ARMv6-M architecture (used by the Cortex M0/M0+/M1) as a subset of the ARMv7-M profile with fewer instructions.

Architecture versions

[edit]
ARMv1

26-bit addressing - obsolete as of June 2000[110]

ARMv2

Multiply and multiply-accumulate instructions; coprocessor support - obsolete as of June 2000[110]

ARMv2a

Atomic load-and-store instructions - obsolete as of June 2000[110]

ARMv3

32-bit addressing[110] - obsolete as of July 2005[111]

ARMv3G
No 26-bit addressing support[110] - obsolete as of July 2005[111]
ARMv3M
Long and signed multiplies[110] - obsolete as of July 2005[111]
ARMv4

Halfword load and store instructions; sign-extending byte and halfword load instructions; 26-bit addressing support removed[110]

ARMv4xM
ARMv4, but without long multiply[110] - obsolete as of July 2005[111]
ARMv4T
ARMv4 plus version 1 of Thumb instruction set[110]
ARMv4TxM
ARMv4T, but without long multiply[110] - obsolete as of July 2005[111]
ARMv5

Count leading zeros instruction[110] - obsolete as of July 2005[111]

ARMv5xM
ARMv5, but without long multiply[110] - obsolete as of July 2005[111]
ARMv5T
ARMv5 plus version 2 of Thumb[110]
ARMv5TxM
ARMv5T, but without long multiply[110] - obsolete as of July 2005[111]
ARMv5TE
ARMv5T plus enhanced DSP instructions[110]
ARMv5TExP
ARMv5TE, but without LDRD, MCRR, MRRC, PLD, and STRD enhanced DSP instructions[110]
ARMv5TEJ
ARMv5TE plus Jazelle[111]
ARMv6

Full ARMv5TEJ; byte reversal instructions; exclusive-access load and store instructions; byte and halfword sign-extend and zero-extend instructions; SIMD media instructions; unaligned access support[111]

ARMv6K
ARMv6 plus instructions to support multiprocessor systems[112]
ARMv6T2
ARMv6 plus Thumb-2 instruction set[112]
ARMv7
ARMv7-A, ARMv7-R
Optional signed and unsigned divide; memory and synchronization barrier instructions; preload instruction hint instruction[112]
ARMv7-M
Thumb-2 only[113]
ARMv8
Introduces two Execution states, AArch32 and AArch64, the former of which supports the 32-bit ARM instruction set, called A32, and the Thumb-2 instruction set, called T32, and the latter of which supports a new instruction set with 32 64-bit registers, called A64.
ARMv8-A AArch32, ARMv8-R AArch32
Load-acquire and store-release instructions, crypto instructions, data barrier instruction extensions, Send Event Locally instruction[114]
ARMv8-M
Variant Thumb-2 only[115]

CPU modes

[edit]

Except in the M-profile, the 32-bit ARM architecture specifies several CPU modes, depending on the implemented architecture features. At any moment in time, the CPU can be in only one mode, but it can switch modes due to external events (interrupts) or programmatically.[116]

  • User mode: The only non-privileged mode.
  • FIQ mode: A privileged mode that is entered whenever the processor accepts a fast interrupt request.
  • IRQ mode: A privileged mode that is entered whenever the processor accepts an interrupt.
  • Supervisor (svc) mode: A privileged mode entered whenever the CPU is reset or when an SVC instruction is executed.
  • Abort mode: A privileged mode that is entered whenever a prefetch abort or data abort exception occurs.
  • Undefined mode: A privileged mode that is entered whenever an undefined instruction exception occurs.
  • System mode (ARMv4 and above): The only privileged mode that is not entered by an exception. It can only be entered by executing an instruction that explicitly writes to the mode bits of the Current Program Status Register (CPSR) from another privileged mode (not from user mode).
  • Monitor mode (ARMv6 and ARMv7 Security Extensions, ARMv8 EL3): A monitor mode is introduced to support TrustZone extension in ARM cores.
  • Hyp mode (ARMv7 Virtualization Extensions, ARMv8 EL2): A hypervisor mode that supports Popek and Goldberg virtualization requirements for the non-secure operation of the CPU.[117][118]
  • Thread mode (ARMv6-M, ARMv7-M, ARMv8-M): A mode which can be specified as either privileged or unprivileged. Whether the Main Stack Pointer (MSP) or Process Stack Pointer (PSP) is used can also be specified in CONTROL register with privileged access. This mode is designed for user tasks in RTOS environment but it is typically used in bare-metal for super-loop.
  • Handler mode (ARMv6-M, ARMv7-M, ARMv8-M): A mode dedicated for exception handling (except the RESET which are handled in Thread mode). Handler mode always uses MSP and works in privileged level.

Instruction set

[edit]

The original (and subsequent) ARM implementation was hardwired without microcode, like the much simpler 8-bit 6502 processor used in prior Acorn microcomputers.

The 32-bit ARM architecture (and the 64-bit architecture for the most part) includes the following RISC features:

  • Load–store architecture.
  • No support for unaligned memory accesses in the original version of the architecture. ARMv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed atomicity.[119][120]
  • Uniform 16 × 32-bit register file (including the program counter, stack pointer and the link register).
  • Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of decreased code density. Later, the Thumb instruction set added 16-bit instructions and increased code density.
  • Mostly single clock-cycle execution.

To compensate for the simpler design, compared with processors like the Intel 80286 and Motorola 68020, some additional design features were used:

  • Conditional execution of most instructions reduces branch overhead and compensates for the lack of a branch predictor in early chips.
  • Arithmetic instructions alter condition codes only when desired.
  • 32-bit barrel shifter can be used without performance penalty with most arithmetic instructions and address calculations.
  • Has powerful indexed addressing modes.
  • A link register supports fast leaf function calls.
  • A simple, but fast, 2-priority-level interrupt subsystem has switched register banks.

Arithmetic instructions

[edit]

ARM includes integer arithmetic operations for add, subtract, and multiply; some versions of the architecture also support divide operations.

ARM supports 32-bit × 32-bit multiplies with either a 32-bit result or 64-bit result, though Cortex-M0 / M0+ / M1 cores do not support 64-bit results.[121] Some ARM cores also support 16-bit × 16-bit and 32-bit × 16-bit multiplies.

The divide instructions are only included in the following ARM architectures:

  • Armv7-M and Armv7E-M architectures always include divide instructions.[122]
  • Armv7-R architecture always includes divide instructions in the Thumb instruction set, but optionally in its 32-bit instruction set.[123]
  • Armv7-A architecture optionally includes the divide instructions. The instructions might not be implemented, or implemented only in the Thumb instruction set, or implemented in both the Thumb and ARM instruction sets, or implemented if the Virtualization Extensions are included.[123]

Registers

[edit]
Registers across CPU modes
usr sys svc abt und irq fiq
R0
R1
R2
R3
R4
R5
R6
R7
R8 R8_fiq
R9 R9_fiq
R10 R10_fiq
R11 R11_fiq
R12 R12_fiq
R13 R13_svc R13_abt R13_und R13_irq R13_fiq
R14 R14_svc R14_abt R14_und R14_irq R14_fiq
R15
CPSR
SPSR_svc SPSR_abt SPSR_und SPSR_irq SPSR_fiq

Registers R0 through R7 are the same across all CPU modes; they are never banked.

Registers R8 through R12 are the same across all CPU modes except FIQ mode. FIQ mode has its own distinct R8 through R12 registers.

R13 and R14 are banked across all privileged CPU modes except system mode. That is, each mode that can be entered because of an exception has its own R13 and R14. These registers generally contain the stack pointer and the return address from function calls, respectively.

Aliases:

The Current Program Status Register (CPSR) has the following 32 bits.[124]

  • M (bits 0–4) is the processor mode bits.
  • T (bit 5) is the Thumb state bit.
  • F (bit 6) is the FIQ disable bit.
  • I (bit 7) is the IRQ disable bit.
  • A (bit 8) is the imprecise data abort disable bit.
  • E (bit 9) is the data endianness bit.
  • IT (bits 10–15 and 25–26) is the if-then state bits.
  • GE (bits 16–19) is the greater-than-or-equal-to bits.
  • DNM (bits 20–23) is the do not modify bits.
  • J (bit 24) is the Java state bit.
  • Q (bit 27) is the sticky overflow bit.
  • V (bit 28) is the overflow bit.
  • C (bit 29) is the carry/borrow/extend bit.
  • Z (bit 30) is the zero bit.
  • N (bit 31) is the negative/less than bit.

Conditional execution

[edit]

Almost every ARM instruction has a conditional execution feature called predication, which is implemented with a 4-bit condition code selector (the predicate). To allow for unconditional execution, one of the four-bit codes causes the instruction to be always executed. Most other CPU architectures only have condition codes on branch instructions.[125]

Though the predicate takes up four of the 32 bits in an instruction code, and thus cuts down significantly on the encoding bits available for displacements in memory access instructions, it avoids branch instructions when generating code for small if statements. Apart from eliminating the branch instructions themselves, this preserves the fetch/decode/execute pipeline at the cost of only one cycle per skipped instruction.

An algorithm that provides a good example of conditional execution is the subtraction-based Euclidean algorithm for computing the greatest common divisor. In the C programming language, the algorithm can be written as:

int gcd(int a, int b) {
  while (a != b)  // We enter the loop when a < b or a > b, but not when a == b
    if (a > b)   // When a > b we do this
      a -= b;
    else         // When a < b we do that (no "if (a < b)" needed since a != b is checked in while condition)
      b -= a;
  return a;
}

The same algorithm can be rewritten in a way closer to target ARM instructions as:

loop:
    // Compare a and b
    GT = a > b;
    LT = a < b;
    NE = a != b;

    // Perform operations based on flag results
    if (GT) a -= b;    // Subtract *only* if greater-than
    if (LT) b -= a;    // Subtract *only* if less-than
    if (NE) goto loop; // Loop *only* if compared values were not equal
    return a;

and coded in assembly language as:

; assign a to register r0, b to r1
loop:   CMP    r0, r1       ; set condition "NE" if (a ≠ b),
                            ;               "GT" if (a > b),
                            ;            or "LT" if (a < b)
        SUBGT  r0, r0, r1   ; if "GT" (Greater Than), then a = a − b
        SUBLT  r1, r1, r0   ; if "LT" (Less    Than), then b = b − a
        BNE  loop           ; if "NE" (Not Equal), then loop
        B    lr             ; return

which avoids the branches around the then and else clauses. If r0 and r1 are equal then neither of the SUB instructions will be executed, eliminating the need for a conditional branch to implement the while check at the top of the loop, for example had SUBLE (less than or equal) been used.

One of the ways that Thumb code provides a more dense encoding is to remove the four-bit selector from non-branch instructions.

Other features

[edit]

Another feature of the instruction set is the ability to fold shifts and rotates into the data processing (arithmetic, logical, and register-register move) instructions, so that, for example, the statement in C language:

a += (j << 2);

could be rendered as a one-word, one-cycle instruction:[126]

ADD  Ra, Ra, Rj, LSL #2

This results in the typical ARM program being denser than expected with fewer memory accesses; thus the pipeline is used more efficiently.

The ARM processor also has features rarely seen in other RISC architectures, such as PC-relative addressing (indeed, on the 32-bit[1] ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.

The ARM instruction set has increased over time. Some early ARM processors (before ARM7TDMI), for example, have no instruction to store a two-byte quantity.

Pipelines and other implementation issues

[edit]

The ARM7 and earlier implementations have a three-stage pipeline; the stages being fetch, decode, and execute. Higher-performance designs, such as the ARM9, have deeper pipelines: Cortex-A8 has thirteen stages. Additional implementation changes for higher performance include a faster adder and more extensive branch prediction logic. The difference between the ARM7DI and ARM7DMI cores, for example, was an improved multiplier; hence the added "M".

Coprocessors

[edit]

The ARM architecture (pre-Armv8) provides a non-intrusive way of extending the instruction set using "coprocessors" that can be addressed using MCR, MRC, MRRC, MCRR, and similar instructions. The coprocessor space is divided logically into 16 coprocessors with numbers from 0 to 15, coprocessor 15 (cp15) being reserved for some typical control functions like managing the caches and MMU operation on processors that have one.

In ARM-based machines, peripheral devices are usually attached to the processor by mapping their physical registers into ARM memory space, into the coprocessor space, or by connecting to another device (a bus) that in turn attaches to the processor. Coprocessor accesses have lower latency, so some peripherals—for example, an XScale interrupt controller—are accessible in both ways: through memory and through coprocessors.

In other cases, chip designers only integrate hardware using the coprocessor mechanism. For example, an image processing engine might be a small ARM7TDMI core combined with a coprocessor that has specialised operations to support a specific set of HDTV transcoding primitives.

Debugging

[edit]

All modern ARM processors include hardware debugging facilities, allowing software debuggers to perform operations such as halting, stepping, and breakpointing of code starting from reset. These facilities are built using JTAG support, though some newer cores optionally support ARM's own two-wire "SWD" protocol. In ARM7TDMI cores, the "D" represented JTAG debug support, and the "I" represented presence of an "EmbeddedICE" debug module. For ARM7 and ARM9 core generations, EmbeddedICE over JTAG was a de facto debug standard, though not architecturally guaranteed.

The ARMv7 architecture defines basic debug facilities at an architectural level. These include breakpoints, watchpoints and instruction execution in a "Debug Mode"; similar facilities were also available with EmbeddedICE. Both "halt mode" and "monitor" mode debugging are supported. The actual transport mechanism used to access the debug facilities is not architecturally specified, but implementations generally include JTAG support.

There is a separate ARM "CoreSight" debug architecture, which is not architecturally required by ARMv7 processors.

Debug Access Port

[edit]

The Debug Access Port (DAP) is an implementation of an ARM Debug Interface.[127] There are two different supported implementations, the Serial Wire JTAG Debug Port (SWJ-DP) and the Serial Wire Debug Port (SW-DP).[128] CMSIS-DAP is a standard interface that describes how various debugging software on a host PC can communicate over USB to firmware running on a hardware debugger, which in turn talks over SWD or JTAG to a CoreSight-enabled ARM Cortex CPU.[129][130][131]

DSP enhancement instructions

[edit]

To improve the ARM architecture for digital signal processing and multimedia applications, DSP instructions were added to the instruction set.[132] These are signified by an "E" in the name of the ARMv5TE and ARMv5TEJ architectures. E-variants also imply T, D, M, and I.

The new instructions are common in digital signal processor (DSP) architectures. They include variations on signed multiply–accumulate, saturated add and subtract, and count leading zeros.

First introduced in 1999, this extension of the core instruction set contrasted with ARM's earlier DSP coprocessor known as Piccolo, which employed a distinct, incompatible instruction set whose execution involved a separate program counter.[133] Piccolo instructions employed a distinct register file of sixteen 32-bit registers, with some instructions combining registers for use as 48-bit accumulators and other instructions addressing 16-bit half-registers. Some instructions were able to operate on two such 16-bit values in parallel. Communication with the Piccolo register file involved load to Piccolo and store from Piccolo coprocessor instructions via two buffers of eight 32-bit entries. Described as reminiscent of other approaches, notably Hitachi's SH-DSP and Motorola's 68356, Piccolo did not employ dedicated local memory and relied on the bandwidth of the ARM core for DSP operand retrieval, impacting concurrent performance.[134] Piccolo's distinct instruction set also proved not to be a "good compiler target".[133]

SIMD extensions for multimedia

[edit]

Introduced in the ARMv6 architecture, this was a precursor to Advanced SIMD, also named Neon.[135]

Jazelle

[edit]

Jazelle DBX (Direct Bytecode eXecution) is a technique that allows Java bytecode to be executed directly in the ARM architecture as a third execution state (and instruction set) alongside the existing ARM and Thumb-mode. Support for this state is signified by the "J" in the ARMv5TEJ architecture, and in ARM9EJ-S and ARM7EJ-S core names. Support for this state is required starting in ARMv6 (except for the ARMv7-M profile), though newer cores only include a trivial implementation that provides no hardware acceleration.

Thumb

[edit]

To improve compiled code density, processors since the ARM7TDMI (released in 1994[136]) have featured the Thumb compressed instruction set, which have their own state. (The "T" in "TDMI" indicates the Thumb feature.) When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the ARM instruction set.[137] Most of the Thumb instructions are directly mapped to normal ARM instructions. The space saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the ARM instructions executed in the ARM instruction set state.

In Thumb, the 16-bit opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU's general-purpose registers. The shorter opcodes give improved code density overall, even though some operations require extra instructions. In situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased performance compared with 32-bit ARM code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.

Unlike processor architectures with variable length (16- or 32-bit) instructions, such as the Cray-1 and Hitachi SuperH, the ARM and Thumb instruction sets exist independently of each other. Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16-bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instructions into the 32-bit bus accessible memory.

The first processor with a Thumb instruction decoder was the ARM7TDMI. All processors supporting 32-bit instruction sets, starting with ARM9, and including XScale, have included a Thumb instruction decoder. It includes instructions adopted from the Hitachi SuperH (1992), which was licensed by ARM.[138] ARM's smallest processor families (Cortex M0 and M1) implement only the 16-bit Thumb instruction set for maximum performance in lowest cost applications. ARM processors that don't support 32-bit addressing also omit Thumb.

Thumb-2

[edit]

Thumb-2 technology was introduced in the ARM1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth, thus producing a variable-length instruction set. A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.

Thumb-2 extends the Thumb instruction set with bit-field manipulation, table branches and conditional execution. At the same time, the ARM instruction set was extended to maintain equivalent functionality in both instruction sets. A new "Unified Assembly Language" (UAL) supports generation of either Thumb or ARM instructions from the same source code; versions of Thumb seen on ARMv7 processors are essentially as capable as ARM code (including the ability to write interrupt handlers). This requires a bit of care, and use of a new "IT" (if-then) instruction, which permits up to four successive instructions to execute based on a tested condition, or on its inverse. When compiling into ARM code, this is ignored, but when compiling into Thumb it generates an actual instruction. For example:

; if (r0 == r1)
CMP r0, r1
ITE EQ        ; ARM: no code ... Thumb: IT instruction
; then r0 = r2;
MOVEQ r0, r2  ; ARM: conditional; Thumb: condition via ITE 'T' (then)
; else r0 = r3;
MOVNE r0, r3  ; ARM: conditional; Thumb: condition via ITE 'E' (else)
; recall that the Thumb MOV instruction has no bits to encode "EQ" or "NE".

All ARMv7 chips support the Thumb instruction set. All chips in the Cortex-A series that support ARMv7, all Cortex-R series, and all ARM11 series support both "ARM instruction set state" and "Thumb instruction set state", while chips in the Cortex-M series support only the Thumb instruction set.[139][140][141]

Thumb Execution Environment (ThumbEE)

[edit]

ThumbEE (erroneously called Thumb-2EE in some ARM documentation), which was marketed as Jazelle RCT[142] (Runtime Compilation Target), was announced in 2005 and deprecated in 2011. It first appeared in the Cortex-A8 processor. ThumbEE is a fourth instruction set state, making small changes to the Thumb-2 extended instruction set. These changes make the instruction set particularly suited to code generated at runtime (e.g. by JIT compilation) in managed Execution Environments. ThumbEE is a target for languages such as Java, C#, Perl, and Python, and allows JIT compilers to output smaller compiled code without reducing performance.[citation needed]

New features provided by ThumbEE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check, and special instructions that call a handler. In addition, because it utilises Thumb-2 technology, ThumbEE provides access to registers r8–r15 (where the Jazelle/DBX Java VM state is held).[143] Handlers are small sections of frequently called code, commonly used to implement high level languages, such as allocating memory for a new object. These changes come from repurposing a handful of opcodes, and knowing the core is in the new ThumbEE state.

On 23 November 2011, Arm deprecated any use of the ThumbEE instruction set,[144] and Armv8 removes support for ThumbEE.

Floating-point (VFP)

[edit]

VFP (Vector Floating Point) technology is a floating-point unit (FPU) coprocessor extension to the ARM architecture[145] (implemented differently in Armv8 – coprocessors not defined there). It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture was intended to support execution of short "vector mode" instructions but these operated on each vector element sequentially and thus did not offer the performance of true single instruction, multiple data (SIMD) vector parallelism. This vector mode was therefore removed shortly after its introduction,[146] to be replaced with the much more powerful Advanced SIMD, also named Neon.

Some devices such as the ARM Cortex-A8 have a cut-down VFPLite module instead of a full VFP module, and require roughly ten times more clock cycles per float operation.[147] Pre-Armv8 architecture implemented floating-point/SIMD with the coprocessor interface. Other floating-point and/or SIMD units found in ARM-based processors using the coprocessor interface include FPA, FPE, iwMMXt, some of which were implemented in software by trapping but could have been implemented in hardware. They provide some of the same functionality as VFP but are not opcode-compatible with it. FPA10 also provides extended precision, but implements correct rounding (required by IEEE 754) only in single precision.[148]

VFPv1
Obsolete
VFPv2
An optional extension to the ARM instruction set in the ARMv5TE, ARMv5TEJ and ARMv6 architectures. VFPv2 has 16 64-bit FPU registers.
VFPv3 or VFPv3-D32
Implemented on most Cortex-A8 and A9 ARMv7 processors. It is backward-compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3 has 32 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers.
VFPv3-D16
As above, but with only 16 64-bit FPU registers. Implemented on Cortex-R4 and R5 processors and the Tegra 2 (Cortex-A9).
VFPv3-F16
Uncommon; it supports IEEE754-2008 half-precision (16-bit) floating point as a storage format.
VFPv4 or VFPv4-D32
Implemented on Cortex-A12 and A15 ARMv7 processors, Cortex-A7 optionally has VFPv4-D32 in the case of an FPU with Neon.[149] VFPv4 has 32 64-bit FPU registers as standard, adds both half-precision support as a storage format and fused multiply-accumulate instructions to the features of VFPv3.
VFPv4-D16
As above, but it has only 16 64-bit FPU registers. Implemented on Cortex-A5 and A7 processors in the case of an FPU without Neon.[149]
VFPv5-D16-M
Implemented on Cortex-M7 when single and double-precision floating-point core option exists.

In Debian Linux and derivatives such as Ubuntu and Linux Mint, armhf (ARM hard float) refers to the ARMv7 architecture including the additional VFP3-D16 floating-point hardware extension (and Thumb-2) above. Software packages and cross-compiler tools use the armhf vs. arm/armel suffixes to differentiate.[150]

Advanced SIMD (Neon)

[edit]

The Advanced SIMD extension (also known as Neon or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardised acceleration for media and signal processing applications. Neon is included in all Cortex-A8 devices, but is optional in Cortex-A9 devices.[151] Neon can execute MP3 audio decoding on CPUs running at 10 MHz, and can run the GSM adaptive multi-rate (AMR) speech codec at 13 MHz. It features a comprehensive instruction set, separate register files, and independent execution hardware.[152] Neon supports 8-, 16-, 32-, and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In Neon, the SIMD supports up to 16 operations at the same time. The Neon hardware shares the same floating-point registers as used in VFP. Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors, but will execute with 64 bits at a time,[147] whereas some more powerful CPUs such as Cortex-A15 can execute 128 bits at a time.[153][154]

A quirk of Neon in Armv7 devices is that it flushes all subnormal numbers to zero, and as a result the GCC compiler will not use it unless -funsafe-math-optimizations, which allows losing denormals, is turned on. "Enhanced" Neon defined since Armv8 does not have this quirk, but as of GCC 8.2 the same flag is still required to enable Neon instructions.[155] On the other hand, GCC does consider Neon safe on AArch64 for Armv8.

ProjectNe10 is ARM's first open-source project (from its inception; while they acquired an older project, now named Mbed TLS). The Ne10 library is a set of common, useful functions written in both Neon and C (for compatibility). The library was created to allow developers to use Neon optimisations without learning Neon, but it also serves as a set of highly optimised Neon intrinsic and assembly code examples for common DSP, arithmetic, and image processing routines. The source code is available on GitHub.[156]

ARM Helium technology

[edit]

Helium is the M-Profile Vector Extension (MVE). It adds more than 150 scalar and vector instructions.[157]

Security extensions

[edit]

TrustZone (for Cortex-A profile)

[edit]

The Security Extensions, marketed as TrustZone Technology, is in ARMv6KZ and later application profile architectures. It provides a low-cost alternative to adding another dedicated security core to an SoC, by providing two virtual processors backed by hardware based access control. This lets the application core switch between two states, referred to as worlds (to reduce confusion with other names for capability domains), to prevent information leaking from the more trusted world (the Secure world) to the less trusted world (the Normal world).[158] This world switch is generally orthogonal to all other capabilities of the processor, thus each world can operate independently of the other while using the same core. Memory and peripherals are then made aware of the operating world of the core and may use this to provide access control to secrets and code on the device.[159]

Typically, a rich operating system is run in the less trusted world, with smaller security-specialised code in the more trusted world, aiming to reduce the attack surface. Typical applications include DRM functionality for controlling the use of media on ARM-based devices,[160] and preventing any unapproved use of the device.

In practice, since the specific implementation details of proprietary TrustZone implementations have not been publicly disclosed for review, it is unclear what level of assurance is provided for a given threat model, but they are not immune from attack.[161][162]

Open Virtualization[163] is an open source implementation of the trusted world architecture for TrustZone.

AMD has licensed and incorporated TrustZone technology into its Secure Processor Technology.[164] AMD's APUs include a Cortex-A5 processor for handling secure processing, which is enabled in some, but not all products.[165][166][167] In fact, the Cortex-A5 TrustZone core had been included in earlier AMD products, but was not enabled due to time constraints.[166]

Samsung Knox uses TrustZone for purposes such as detecting modifications to the kernel, storing certificates and attestating keys.[168]

TrustZone for Armv8-M (for Cortex-M profile)

[edit]

The Security Extension, marketed as TrustZone for Armv8-M Technology, was introduced in the Armv8-M architecture. While containing similar concepts to TrustZone for Armv8-A, it has a different architectural design, as world switching is performed using branch instructions instead of using exceptions.[169] It also supports safe interleaved interrupt handling from either world regardless of the current security state. Together these features provide low latency calls to the secure world and responsive interrupt handling. ARM provides a reference stack of secure world code in the form of Trusted Firmware for M and PSA Certified.

No-execute page protection

[edit]

As of ARMv6, the ARM architecture supports no-execute page protection, which is referred to as XN, for eXecute Never.[170]

Large Physical Address Extension (LPAE)

[edit]

The Large Physical Address Extension (LPAE), which extends the physical address size from 32 bits to 40 bits, was added to the Armv7-A architecture in 2011.[171]

The physical address size may be even larger in processors based on the 64-bit (Armv8-A) architecture. For example, it is 44 bits in Cortex-A75 and Cortex-A65AE.[172]

Armv8-R and Armv8-M

[edit]

The Armv8-R and Armv8-M architectures, announced after the Armv8-A architecture, share some features with Armv8-A. However, Armv8-M does not include any 64-bit AArch64 instructions, and Armv8-R originally did not include any AArch64 instructions; those instructions were added to Armv8-R later.

Armv8.1-M

[edit]

The Armv8.1-M architecture, announced in February 2019, is an enhancement of the Armv8-M architecture. It brings new features including:

  • A new vector instruction set extension. The M-Profile Vector Extension (MVE), or Helium, is for signal processing and machine learning applications.
  • Additional instruction set enhancements for loops and branches (Low Overhead Branch Extension).
  • Instructions for half-precision floating-point support.
  • Instruction set enhancement for TrustZone management for Floating Point Unit (FPU).
  • New memory attribute in the Memory Protection Unit (MPU).
  • Enhancements in debug including Performance Monitoring Unit (PMU), Unprivileged Debug Extension, and additional debug support focus on signal processing application developments.
  • Reliability, Availability and Serviceability (RAS) extension.

64/32-bit architecture

[edit]
Armv8-A Platform with Cortex A57/A53 MPCore big.LITTLE CPU chip

Armv8

[edit]

Armv8-A

[edit]

Announced in October 2011,[13] Armv8-A (often called ARMv8 while the Armv8-R is also available) represents a fundamental change to the ARM architecture. It supports two Execution states: a 64-bit state named AArch64 and a 32-bit state named AArch32. In the AArch64 state, a new 64-bit A64 instruction set is supported; in the AArch32 state, two instruction sets are supported: the original 32-bit instruction set, named A32, and the 32-bit Thumb-2 instruction set, named T32. AArch32 provides user-space compatibility with Armv7-A. The processor state can change on an Exception level change; this allows 32-bit applications to be executed in AArch32 state under a 64-bit OS whose kernel executes in AArch64 state, and allows a 32-bit OS to run in AArch32 state under the control of a 64-bit hypervisor running in AArch64 state.[1] ARM announced their Cortex-A53 and Cortex-A57 cores on 30 October 2012.[78] Apple was the first to release an Armv8-A compatible core in a consumer product (Apple A7 in iPhone 5S). AppliedMicro, using an FPGA, was the first to demo Armv8-A.[173] The first Armv8-A SoC from Samsung is the Exynos 5433 used in the Galaxy Note 4, which features two clusters of four Cortex-A57 and Cortex-A53 cores in a big.LITTLE configuration; but it will run only in AArch32 mode.[174]

To both AArch32 and AArch64, Armv8-A makes VFPv3/v4 and advanced SIMD (Neon) standard. It also adds cryptography instructions supporting AES, SHA-1/SHA-256 and finite field arithmetic.[175] AArch64 was introduced in Armv8-A and its subsequent revision. AArch64 is not included in the 32-bit Armv8-R and Armv8-M architectures.

An ARMv8-A processor can support one or both of AArch32 and AArch64; it may support AArch32 and AArch64 at lower Exception levels and only AArch64 at higher Exception levels.[176] For example, the ARM Cortex-A32 supports only AArch32,[177] the ARM Cortex-A34 supports only AArch64,[178] and the ARM Cortex-A72 supports both AArch64 and AArch32.[179] An ARMv9-A processor must support AArch64 at all Exception levels, and may support AArch32 at EL0.[176]

Armv8-R

[edit]

Optional AArch64 support was added to the Armv8-R profile, with the first ARM core implementing it being the Cortex-R82.[180] It adds the A64 instruction set.

Armv9

[edit]

Armv9-A

[edit]

Announced in March 2021, the updated architecture places a focus on secure execution and compartmentalisation.[181][182] The first ARMv9-A processors were released later that year, including the Cortex-A510, Cortex-A710 and Cortex-X2.

Arm SystemReady

[edit]

Arm SystemReady is a compliance program that helps ensure the interoperability of an operating system on Arm-based hardware from datacenter servers to industrial edge and IoT devices. The key building blocks of the program are the specifications for minimum hardware and firmware requirements that the operating systems and hypervisors can rely upon. These specifications are:[183]

  • Base System Architecture (BSA)[184] and the market segment specific supplements (e.g., Server BSA supplement)[185]
  • Base Boot Requirements (BBR)[186] and Base Boot Security Requirements (BBSR)[187]

These specifications are co-developed by Arm and its partners in the System Architecture Advisory Committee (SystemArchAC).

Architecture Compliance Suite (ACS) is the test tools that help to check the compliance of these specifications. The Arm SystemReady Requirements Specification documents the requirements of the certifications.[188]

This program was introduced by Arm in 2020 at the first DevSummit event. Its predecessor Arm ServerReady was introduced in 2018 at the Arm TechCon event. This program currently includes two bands:

  • SystemReady Band: this band focuses on operating system interoperability for Advanced Configuration and Power Interface ACPI environments, where generic operating systems can be installed on either new or old hardware without modification. This band is relevant for systems using Windows, Linux, VMware, and BSD environments.[189]
  • SystemReady Devicetree Band: this band optimizes install and boot for embedded systems where devicetree is the preferred method of describing hardware, with a focus on forward compatibility. This applies to Linux distributions and BSD environments specifically.[190]

PSA Certified

[edit]

PSA Certified, formerly named Platform Security Architecture, is an architecture-agnostic security framework and evaluation scheme. It is intended to help secure Internet of things (IoT) devices built on system-on-a-chip (SoC) processors.[191] It was introduced to increase security where a full trusted execution environment is too large or complex.[192]

The architecture was introduced by Arm in 2017 at the annual TechCon event.[192][193] Although the scheme is architecture agnostic, it was first implemented on Arm Cortex-M processor cores intended for microcontroller use. PSA Certified includes freely available threat models and security analyses that demonstrate the process for deciding on security features in common IoT products.[194] It also provides freely downloadable application programming interface (API) packages, architectural specifications, open-source firmware implementations, and related test suites.[195]

Following the development of the architecture security framework in 2017, the PSA Certified assurance scheme launched two years later at Embedded World in 2019.[196] PSA Certified offers a multi-level security evaluation scheme for chip vendors, OS providers and IoT device makers.[197] The Embedded World presentation introduced chip vendors to Level 1 Certification. A draft of Level 2 protection was presented at the same time.[198] Level 2 certification became a usable standard in February 2020.[199]

The certification was created by PSA Joint Stakeholders to enable a security-by-design approach for a diverse set of IoT products. PSA Certified specifications are implementation and architecture agnostic, as a result they can be applied to any chip, software or device.[200][198] The certification also removes industry fragmentation for IoT product manufacturers and developers.[201]

Operating system support

[edit]

32-bit operating systems

[edit]

Historical operating systems

[edit]

The first 32-bit ARM-based personal computer, the Acorn Archimedes, was originally intended to run an ambitious operating system called ARX. The machines shipped with RISC OS, which was also used on later ARM-based systems from Acorn and other vendors. Some early Acorn machines were also able to run a Unix port called RISC iX. (Neither is to be confused with RISC/os, a contemporary Unix variant for the MIPS architecture.)

Embedded operating systems

[edit]

The 32-bit ARM architecture is supported by a large number of embedded and real-time operating systems, including:

Mobile device operating systems

[edit]

As of March 2024, the 32-bit ARM architecture used to be the primary hardware environment for most mobile device operating systems such as the following but many of these platforms such as Android and Apple iOS have evolved to the 64-bit ARM architecture:

Formerly, but now discontinued:

Desktop and server operating systems

[edit]

The 32-bit ARM architecture is supported by RISC OS and by multiple Unix-like operating systems including:

64-bit operating systems

[edit]

Embedded operating systems

[edit]

Mobile device operating systems

[edit]

Desktop and server operating systems

[edit]

Porting to 32- or 64-bit ARM operating systems

[edit]

Windows applications recompiled for ARM and linked with Winelib, from the Wine project, can run on 32-bit or 64-bit ARM in Linux, FreeBSD, or other compatible operating systems.[231][232] x86 binaries, e.g. when not specially compiled for ARM, have been demonstrated on ARM using QEMU with Wine (on Linux and more),[citation needed] but do not work at full speed or same capability as with Winelib.

Notes

[edit]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The ARM architecture family is a reduced instruction set computing (RISC) (ISA) designed for efficient, low-power processors, defining the rules for how software interacts with hardware to ensure compatibility across billions of devices worldwide. Originating from designs at in the 1980s, it emphasizes energy efficiency, scalability, and versatility, powering everything from smartphones and embedded systems to servers and automotive controllers through a licensing model where ARM provides (IP) cores rather than fabricating chips. The architecture has evolved through multiple versions, from the initial Armv1 in 1985 to the current Armv9, incorporating advancements in performance, security, and AI acceleration while maintaining backward compatibility. ARM's CPU architecture is divided into three main profiles tailored to distinct applications:
  • The A-profile for high-performance, general-purpose computing in devices like smartphones, PCs, and servers. The A-profile, the most prominent, supports rich operating systems and has progressed from Armv8-A (introduced in 2011 with 64-bit AArch64 execution) to Armv9-A (launched in 2021), which adds scalable vector extensions for AI workloads and enhanced security features like confidential computing.
  • The R-profile for real-time, deterministic operations in safety-critical systems such as automotive braking and medical equipment. Meanwhile, the R-profile (up to Armv8-R) prioritizes low-latency responses.
  • The M-profile for low-power microcontrollers in IoT sensors, wearables, and smart home devices. Meanwhile, the M-profile (up to Armv8-M) focuses on minimal code size and power consumption with optional TrustZone security.
Key milestones in the architecture's development include the founding of in 1990 as a between Acorn, Apple, and , shifting to an IP licensing business that enabled widespread adoption. Early versions like Armv4 (1990s) introduced the compact instruction set for embedded efficiency, while Armv6 (2004) added SIMD capabilities and multi-core support; Armv7 (2006) mandated Thumb-2 for better code density and debuted the Cortex processor family. By Armv8, the architecture achieved full 64-bit support, and Armv9 further integrates matrix extensions for , with over 325 billion ARM-based chips shipped to date, underscoring its dominance in mobile (e.g., 99% ) and emerging AI ecosystems. This evolution reflects ARM's focus on balancing power, performance, and security across diverse markets.

History

Origins in Acorn Computers

The development of the ARM architecture began in 1983 at , a British firm known for its home computer, which relied on the 8-bit processor. As Acorn sought a successor to enable a shift to 32-bit processing for future systems, engineers and led the effort, with Wilson designing the instruction set and Furber handling the overall chip architecture. The project was motivated by the need for a low-cost, high-performance CPU amid intensifying competition from 16- and 32-bit rivals like the and Motorola 68000. Drawing on emerging RISC () principles from academic research at institutions like the , the team prioritized simplicity to minimize transistor count and power consumption. The design incorporated a , a three-stage , and just 45 instructions, targeting under 1 W of power—ultimately achieving about 0.1 W—to suit battery-powered and embedded applications while integrating seamlessly with Acorn's existing ecosystem. Named the Acorn RISC Machine (), the initial prototype, , was fabricated on a 3 µm process by Inc. and powered up on April 26, 1985, after just 18 months of development using rudimentary tools like for simulation. The featured approximately 25,000 transistors on a compact 7 mm × 7 mm die and operated at a clock speed of 6 MHz, delivering around 4 million (MIPS). The served as a proof-of-concept, tested in internal development boards, and paved the way for production variants. Its architecture debuted commercially in the personal computers launched in 1987, marking 's transition from 8-bit to 32-bit systems and demonstrating the design's efficiency with a performance edge over contemporaries despite the modest clock speed. This foundational work at ultimately led to the formation of an independent licensing company in 1990.

Formation of ARM Holdings

In late 1990, spun off its ARM processor technology into a new entity, Advanced RISC Machines Ltd (ARM Ltd), incorporated in , , as a with Apple Computer and . contributed its and a team of 12 engineers, Apple invested $3 million in cash to secure a significant ownership stake driven by its need for a low-power processor for the upcoming Newton , and VLSI provided and fabrication expertise. This structure gave and Apple each approximately 43% of the shares, with VLSI holding the remaining 14%. The formation marked a pivotal shift from Acorn's in-house development to a fabless focused on licensing rather than manufacturing chips, allowing to commercialize the RISC architecture more broadly. Apple's involvement was crucial, as the Newton project—initiated in 1987—required an efficient, battery-friendly CPU that the ARM design uniquely suited, leading Apple to champion the spin-off and fund its early operations. VLSI's role extended to the first external license in 1990, enabling it to produce and integrate ARM-based chips while supporting the venture's goal of targeting embedded applications like portable devices and peripherals. Early partnerships emphasized ARM's strategy of upfront licensing fees combined with royalties on produced , fostering collaborations beyond the founding trio and positioning the company for global adoption in low-power computing. This approach, rooted in the joint venture's inception on November 27, 1990, laid the foundation for ARM's expansion as an IP provider.

Key Milestones in Development

The development of the ARM architecture began with the introduction of the ARM2 processor in 1987, which added multiply and multiply-accumulate instructions to the original design, enabling more efficient handling of arithmetic operations in embedded systems. This enhancement was crucial for improving performance in early applications like the personal computer, marking ARM's initial foray into commercial computing beyond its Acorn origins. In 1989, the ARM3 processor was released, incorporating an on-chip cache and support for a (FPU) , which significantly boosted processing speeds for graphics and scientific computations in workstations. These advancements solidified ARM's reputation for balancing power efficiency with capability, paving the way for broader adoption in battery-constrained devices. The formation of Advanced RISC Machines Ltd. in November 1990, as a between , Apple Computer, and , represented a pivotal shift toward commercial IP licensing and independent development. This entity released the ARM6 processor in 1992, featuring a (MMU) and enhanced 32-bit processing, which facilitated support and integration into more complex operating systems. A major collaboration emerged in 1996 with , resulting in the family of processors, which delivered high performance at low power—up to 185 MIPS at 160 MHz—while maintaining full compatibility with the ARMv4 instruction set. This partnership expanded ARM's reach into networking and portable computing, demonstrating the architecture's for demanding applications. To address code density challenges in memory-limited environments, ARM introduced the Thumb instruction set in 1994 as part of the ARMv4 architecture, compressing common 32-bit instructions into 16-bit formats to reduce program size by approximately 30-40% without sacrificing much performance. This innovation proved essential for embedded systems, allowing developers to fit more functionality into constrained ROM spaces. In 2002, ARM launched technology, an extension enabling direct hardware execution of , which accelerated (JVM) performance by up to 5-10 times compared to software interpretation alone. By integrating bytecode handling into the processor pipeline, optimized resource usage in mobile and embedded Java applications, anticipating the rise of platform-independent software. Key adoptions underscored these technical strides: the architecture powered Apple's Newton personal digital assistant launched in 1993, utilizing the ARM610 processor to enable and scheduling features in a portable form factor. In the mid-1990s, licensed ARM cores in 1993, followed by Nokia's adoption for handsets like the 6110 in 1998, which leveraged the for efficient and helped establish ARM as a standard in .

Market Growth and Adoption

The ARM architecture experienced significant commercial expansion in the , driven by its adoption in mobile phones due to superior power efficiency compared to competing architectures. Licensees such as with its Snapdragon processors and with chips integrated ARM cores into high-volume platforms, establishing ARM as the for by the mid-. This surge was fueled by the rapid growth of the smartphone market, where ARM's reduced instruction set computing (RISC) design enabled longer battery life and lower costs, leading to a 95% in mobile phone processors by 2010. By the 2010s, ARM had solidified its dominance in embedded systems, powering devices from consumer electronics to industrial applications, with cumulative shipments of ARM-based chips exceeding 325 billion units as of 2025. The post-2015 Internet of Things (IoT) boom further accelerated this adoption, as ARM's low-power cores like the Cortex-M series became integral to connected sensors, wearables, and smart home devices, contributing to a projected compound annual growth rate of 19% in IoT installations from 2014 to 2020. ARM's revenue model, centered on upfront licensing fees and per-chip royalties, capitalized on this scale, with licensing revenue surging 56% year-over-year to $515 million in the fiscal second quarter of 2026, reflecting sustained demand across mobile and emerging sectors. ARM's penetration extended to new markets in the late and , including servers and personal computers. Amazon Web Services introduced the processor in November 2018, marking ARM's entry into with energy-efficient instances for scale-out workloads. Apple's transition to its own ARM-based chips for Macs, announced in June 2020 and rolled out starting late that year, accelerated ARM's adoption in high-performance PCs, breaking from Intel's x86 dominance. By 2025, ARM powered over 99% of smartphones worldwide and was projected to capture more than 50% of the data center market, underscoring its broad industry penetration.

Licensing Model

Core and IP Licensing

The primary mechanism for accessing ARM processor cores involves licensing pre-configured designs such as the Cortex family, which are delivered as complete (IP) blocks including the processor core, associated caches, and interconnect buses like CoreLink. These licenses enable licensees to integrate the IP directly into system-on-chip (SoC) designs, ensuring compatibility with the ecosystem while minimizing development time. Pricing for core licenses typically follows a hybrid model combining upfront fees with per-unit royalties. As reported in the early 2010s, upfront fees for standard Cortex core implementations ranged from approximately $1 million to $10 million, depending on the core's complexity and the licensee's scale, while royalties were generally 1% to 2% of the selling price per shipped chip; current terms are negotiated individually and not publicly disclosed. For example, licensing a high-performance core like the Cortex-A78 incurs these costs to grant access to its synthesizable design for premium mobile applications. ARM supports customization through two main delivery formats: binary-compatible processor implementations, which are fixed, pre-verified designs for rapid integration, and synthesizable (RTL) code, which allows licensees to modify the core for optimization in power, performance, or area while preserving ARM instruction set compatibility. The RTL format, provided in , facilitates architectural extensions and integration into custom SoCs, particularly for integrated device manufacturers (). By 2025, has over 350 active licenses across its programs, including 44 Arm Total Access licenses—a subscription-based program providing comprehensive access to Arm's IP portfolio—and 314 Arm Flexible Access licenses, enabling a vast array of partners to develop products. This licensing approach plays a pivotal role in the fabless ecosystem, allowing companies without fabrication facilities—such as and —to design and outsource production of ARM-based chips, driving innovation in mobile, automotive, and IoT markets without the need for in-house architecture development.

Architectural and Flexible Access Licenses

The Architectural License, also known as the Architecture License Agreement (ALA), grants licensees full access to Arm's (ISA) specifications, enabling the design of custom microarchitectures that remain compliant with Arm standards. This license is particularly suited for companies seeking to optimize performance for specific workloads by developing proprietary processor cores, while ensuring broad software ecosystem compatibility across -based devices. Notable adopters include Apple, which utilizes the license for its M-series processors in Macs and other devices; , for custom CPU designs in Snapdragon SoCs; and (AWS), for the processor family powering cloud infrastructure. Key terms of the Architectural License include coverage of major ISA versions such as Armv8-A and Armv9-A, providing detailed technical documentation for instruction sets, extensions, and system architectures without granting exclusive rights—licensees receive non-exclusive permissions to implement and commercialize compliant designs. Royalties are typically assessed per shipped unit, scaled by volume and application, allowing differentiation through tailored implementations like high-efficiency cores for mobile or server environments. This model benefits licensees by fostering innovation beyond off-the-shelf cores, as seen in Apple's performance-optimized M-series for AI and tasks, or AWS Graviton's focus on efficiency, which has delivered up to 20% better price-performance in EC2 instances compared to x86 alternatives. Introduced in 2019, the Arm Flexible Access program serves as an entry-level licensing option, offering startups and small-to-medium enterprises upfront, no-cost or low-cost access to a curated portfolio of Arm IP, including processor cores, tools, and training resources, to prototype system-on-chip (SoC) designs. Under this program, qualifying startups receive $0 entry-tier membership, enabling unlimited evaluation and design iterations without initial fees, with royalties and manufacturing licenses activating only upon tape-out of a production design. It covers select ISA implementations, such as Armv8-A through Cortex-A series cores, Mali GPUs, and CoreLink interconnects, supporting applications from IoT to edge AI. The Flexible Access model's royalty-based scaling—deferred until commercialization—lowers barriers for emerging companies, allowing them to experiment with technology and achieve market differentiation without prohibitive upfront costs. For instance, it has enabled over 60 partners, including first-time Arm IP users, to accelerate SoC development in high-growth areas like and automotive systems, often reducing time-to-market by providing pre-verified components and ecosystem support. Non-exclusive rights ensure broad applicability, with three membership tiers (DesignStart for free basics, Entry at $0 for startups or $80,000 annually, and Standard at $212,000 annually) tailored to project scale. This approach contrasts with traditional core licensing by emphasizing exploratory access, ultimately facilitating custom designs that leverage Arm's ISA for specialized benefits like power efficiency in startup-led innovations.

Evolution of Licensing Programs

In the early 1990s, ARM's licensing model focused on straightforward (IP) agreements for its processor designs, marking the company's initial shift toward a fabless, royalty-based . The first such licenses were granted in 1991 to GEC Plessey Semiconductors, enabling the production of ARM-based chips for embedded applications. Shortly thereafter, and became licensees, with VLSI integrating ARM cores into its semiconductor offerings and Sharp targeting . These early deals, often involving upfront fees and royalties per shipped unit, laid the foundation for ARM's expansion by allowing partners to manufacture without developing the core IP from scratch. During the 2000s, ARM evolved its licensing to support broader market segments through the introduction of the Cortex family of processor cores, launched in 2005 to standardize designs across application, real-time, and profiles. The Cortex-A series targeted high-performance devices like smartphones, Cortex-R focused on real-time systems such as automotive controllers, and Cortex-M addressed low-power embedded uses, providing licensees with configurable, scalable options under a unified branding. This multi-profile approach simplified for partners, who could select cores tailored to specific needs while benefiting from ARM's ongoing architectural updates, fostering widespread integration in mobile and products. In the 2010s, ARM responded to rising competition from open-source alternatives like by launching the Flexible Access program in 2019, which offered low-barrier entry to its IP portfolio without immediate full licensing commitments. This initiative allowed developers to access over 75% of ARM's designs, including Cortex cores and tools, for a nominal annual fee, deferring royalties until production, thereby attracting startups and reducing upfront costs compared to traditional models. The program directly addressed RISC-V's no-fee appeal by emphasizing ARM's mature ecosystem and performance optimizations, enabling faster prototyping in emerging markets like IoT. The 2020s saw ARM pivot toward AI-centric licensing, incorporating the Scalable Vector Extension (SVE) and its enhancements in Armv9 to support workloads on edge devices. SVE, initially developed for , enables vector lengths up to 2048 bits for efficient AI inference and , with licensing available through core or architectural agreements that integrate these extensions for AI-optimized processors. In 2025, ARM updated its Flexible Access to include edge AI IP bundles, such as the Armv9 platform with Cortex-A320 and Ethos-U85 NPU, providing zero upfront costs for startups to develop on-device AI solutions and compete in the growing sector.

Processor Core Families

Cortex-A Profile Cores

The Cortex-A profile cores form the high-performance segment of ARM's processor family, designed primarily for application processors in devices requiring complex , such as smartphones, tablets, and embedded systems with rich operating systems like Android or . These cores implement the ARMv7-A for 32-bit processing and extend to the 64-bit ARMv8-A and ARMv9-A architectures, emphasizing scalability, management, and support for advanced operating systems. Introduced to address the growing demands of mobile and , the Cortex-A series balances power with computational throughput, enabling seamless multitasking and processing. Representative examples illustrate the evolution of Cortex-A cores across performance tiers and process nodes. The Cortex-A5, announced in 2009 and entering production in 2010, targets low-end applications like feature phones and ultra-low-cost handsets, featuring an in-order 8-stage pipeline, dual-issue execution, and compatibility with the ARMv7-A instruction set for energy-efficient, compact designs. In contrast, the Cortex-A78, unveiled in 2020 and optimized for 5nm process technology, delivers high-end 64-bit performance under ARMv8.2-A, with , improved branch prediction, and up to 20% higher single-threaded performance compared to its predecessor, the Cortex-A77, while reducing power consumption by approximately 50% at equivalent speeds on advanced nodes. More recently, the Cortex-A320, introduced in 2025 as the first ultra-efficient ARMv9 core, focuses on AI-optimized for IoT devices, offering up to 50% better energy efficiency than the Cortex-A520 through a smaller footprint, enhanced AI acceleration via Scalable Matrix Extension (SME), and support for on-device models without compromising security features like Arm TrustZone. Key architectural features in Cortex-A cores enhance their suitability for demanding workloads. High-end variants, such as the Cortex-A78 and later models like the Cortex-A720, incorporate pipelines with dynamic scheduling, allowing up to triple-issue throughput and to minimize stalls, which contributes to sustained performance in multi-threaded environments. The big.LITTLE heterogeneous architecture, widely adopted in Cortex-A implementations, pairs power-hungry "big" cores (e.g., Cortex-A78) with efficient "LITTLE" cores (e.g., Cortex-A55) to dynamically allocate tasks based on workload intensity, achieving up to 75% better energy efficiency in mixed-use scenarios like mobile browsing and gaming by idling high-performance cores during light loads. For instance, the Cortex-A720, part of the ARMv9.2 lineup, delivers approximately 20% better power efficiency compared to the Cortex-A715, enabling premium efficiency in sustained workloads. Cortex-A cores power a diverse of applications, from consumer devices to enterprise . In smartphones, they underpin flagship platforms like Qualcomm's Snapdragon 8 Gen series, where configurations such as the Snapdragon 8 Gen 3 integrate Cortex-X4 prime cores with A720 and A520 clusters for AI-enhanced photography and processing. For personal computers, custom implementations derived from Cortex-A designs, such as Apple's M4 chip in MacBooks, leverage ARMv8-A extensions for desktop-class productivity and creative workflows, delivering over 50% faster CPU performance than prior Intel-based equivalents in battery-constrained scenarios. In servers, AWS Graviton4 processors, built on Neoverse V2 cores evolved from A-profile principles, utilize Cortex-A-derived scalability to handle cloud workloads, offering up to 30% better price-performance for web services and data analytics compared to previous generations. In 2025, ARM rebranded its mobile-oriented Cortex-A derivatives as the Lumex platform for smartphones and tablets, emphasizing AI-specific enhancements like SME2 for matrix computations, while PC-focused variants adopted the Niva branding to target and desktop markets with improved and vector processing. Under these platforms, introduced the C1 series of CPU cores in 2025, including the flagship C1-Ultra, which supports Armv9.3-A and delivers up to 25% higher performance than prior high-end designs, with advanced on-device AI capabilities.

Cortex-R and Cortex-M Profile Cores

The Cortex-R profile of ARM cores is tailored for real-time systems that demand predictable, deterministic performance and minimal latency to ensure reliable operation in safety-critical environments. These cores implement the Armv7-R and Armv8-R instruction set architectures, providing features such as tightly coupled for low-latency access and advanced to maintain consistent timing in hard real-time applications. Unlike application-oriented profiles, Cortex-R emphasizes and , often certified to standards like for automotive use. A representative example is the Cortex-R52, introduced in 2016 as the first Armv8-R implementation in AArch32 mode, which delivers high-performance 32-bit processing with efficient code density and integrated safety mechanisms, including dual-core operation for fault detection in redundant configurations. The Cortex-R82, announced in 2020, advances this further as the highest-performance Cortex-R core, supporting 64-bit Armv8-R in mode with up to 1TB addressable DRAM and enhanced safety features for real-time embedded systems. Cortex-R cores are commonly deployed in automotive electronic control units (ECUs), where their deterministic execution handles time-sensitive tasks like and braking systems. The Cortex-M profile complements the R series by focusing on ultra-low-power microcontrollers for cost-sensitive, deeply embedded applications, spanning Armv6-M to Armv8-M architectures with scalable performance levels from basic control to . These cores prioritize energy efficiency and simplicity, featuring a with separate instruction and data buses to optimize power in battery-operated devices. Key to their design is support for event-driven execution through the Nested Controller (NVIC), which enables low-latency response to external events with deterministic interrupt handling. The Cortex-M0, launched in 2009, exemplifies the profile's origins in ultra-low-power computing, offering a compact 32-bit core with minimal gate count for simple sensor interfaces and control loops. More recent advancements include the Cortex-M85, introduced in 2022, which provides the highest performance in the series via Helium vector processing and integrates TrustZone-M for hardware-enforced security isolation. Cortex-M cores power IoT sensors and wearables, leveraging their event-driven capabilities for responsive, power-efficient operation in connected ecosystems. Cortex-M processors have contributed significantly to the over 250 billion total Arm-based chips shipped as of 2025, dominating the market.

Legacy and Custom Cores

The family of processor cores, introduced in 1993, became a cornerstone of early due to its low power consumption and efficient 32-bit RISC design, making it ubiquitous in feature phones and embedded devices during the late 1990s and early 2000s. A notable implementation, the ARM7TDMI, powered the , the first phone to incorporate an core, which achieved massive commercial success and established ARM as the flagship architecture for mobile designs. This core's with separate instruction and data caches, combined with debug and multiply extensions, enabled widespread adoption in battery-constrained applications like early cellular handsets. Succeeding the ARM7, the ARM9 cores, released in the early 2000s, enhanced performance through a five-stage pipeline and support for the ARMv4T and later architectures, targeting more demanding embedded systems such as digital multimedia devices. The ARM11 family, introduced around 2002 and prevalent until the late 2000s, further advanced efficiency with an eight-stage pipeline and the introduction of Thumb-2 technology in ARMv6 implementations like the ARM1156T2F-S, which expanded the Thumb instruction set to include 32-bit instructions for improved code density and performance in resource-limited environments. These pre-2009 designs emphasized scalar in-order execution, prioritizing power efficiency over aggressive parallelism, and were licensed for use in millions of devices before the shift to more scalable profiles. In 2005, ARM transitioned from these classical cores to the Cortex family, starting with the Cortex-A8 as the first implementation of the ARMv7-A architecture, marking a move toward standardized, configurable designs for broader application scalability. Despite this evolution, custom core development persisted through ARM's architectural licenses, which grant licensees the freedom to create proprietary implementations compliant with the ARM (ISA) while optimizing for specific workloads. Prominent examples of such custom cores include Apple's A-series and M-series processors, which build on the Armv8 ISA with tailored microarchitectures featuring wider execution units, advanced branch prediction, and integrated high-performance cores to deliver superior single-threaded performance in mobile and desktop systems. Qualcomm's series represents semi-custom designs, such as the Kryo 280 in the Snapdragon 835, which modifies ARM Cortex cores like the A53 and A73 with custom tweaks to cache hierarchies and depths for balanced power and throughput in smartphones. Similarly, Samsung's cores, debuted in the 8890 with an ARMv8 base, incorporated wider decode stages and custom floating-point units to enhance processing in mobile SoCs, though production of these fully custom variants ceased around 2019 in favor of hybrid approaches. These custom implementations often achieve performance gains through targeted enhancements, such as increased instruction issue widths or specialized accelerators, without altering the core ISA compatibility.

Instruction Set Architectures

Early Architectures (Armv1 to Armv3)

The ARMv1 architecture, introduced in 1985, marked the debut of the reduced instruction set computing (RISC) design as a 32-bit implemented in the core. This initial version featured a compact set of 25 instructions focused on essential operations, including (such as ADD, SUB, and MOV), access, branches, and software interrupts, without support for multiplication or interfaces. The design emphasized simplicity and efficiency, with a 3-stage consisting of fetch, decode, and execute stages to enable single-cycle instruction execution in most cases. It utilized 16 general-purpose 32-bit registers labeled R0 through R15, where R15 functioned as the , and included a Current Program Status Register (CPSR) for flags like negative (N), zero (Z), carry (C), and overflow (V), though it lacked a Saved Program Status Register (SPSR) and advanced . The architecture supported a 26-bit (64 MB) and operated in four processor modes: User, FIQ (Fast Interrupt), IRQ (Interrupt), and Supervisor, prioritizing low power and high performance per watt for embedded applications. Building on ARMv1, the ARMv2 architecture emerged in 1986 (with refinements continuing into 1987) and introduced key enhancements to expand functionality while maintaining , primarily implemented in the ARM2 and later ARM3 cores. Notable additions included multiply instructions (MUL for single-word multiplication and MLA for multiply-accumulate) and the swap instruction (SWP/SWPB) for atomic memory operations, increasing the instruction count to approximately 30-40 and enabling more efficient handling of arithmetic-intensive tasks. support was also integrated, allowing external units for tasks like floating-point operations via instructions such as MCR and for data transfer. The 3-stage remained central, now with improved interrupt handling through banked registers in FIQ mode (adding two extra registers for faster context switching), and the register set expanded slightly with the introduction of an SPSR for preserving status during exceptions. The stayed at 26 bits, and the architecture continued to support the same four modes, but with better optimization for real-time systems, as seen in its use in the computer released in 1987. These changes solidified ARMv2 as a more versatile foundation for commercial processors, balancing simplicity with expanded capabilities. The ARMv3 architecture, released around 1990 and reaching notable implementations by 1993, further refined the series with a shift to a full 32-bit address space (4 GB) and enhanced support for protected memory, implemented in cores like the ARM6 and early ARM7 family. It built on prior versions by improving the multiplier with long multiply instructions (such as UMULL for unsigned long multiply and UMLAL for unsigned multiply-accumulate with accumulate), alongside signed variants, which proved crucial for signal processing and cryptography applications. Coprocessor support was deepened with better integration for memory management units (MMUs), and new instructions like MRS (Move to Register from Status) and MSR (Move to Status from Register) allowed direct access to CPSR and SPSR for mode switching and flag manipulation. The instruction set grew to about 40-50 entries, incorporating enhanced load/store operations (e.g., signed and unsigned byte/halfword loads) and six processor modes—User, FIQ, IRQ, Supervisor, Undefined, and Abort (for data and prefetch aborts)—for robust exception handling. Retaining the 3-stage pipeline, ARMv3 optimized it for higher clock speeds and added features like a 4 KB instruction cache in some implementations, as exemplified by the ARM6 core. This version gained prominence in desktop systems, notably powering the Acorn RISC PC released in 1994, which demonstrated its viability for multitasking environments with MMU-enabled operating systems like RISC OS. Across ARMv1 to ARMv3, core concepts emphasized a uniform 3-stage for streamlined execution, a bank of 16 visible 32-bit registers (R0-R15) with mode-specific banking for efficiency, and a load/store model that separated from access to reduce complexity and power consumption. These early architectures laid the groundwork for ARM's dominance in low-power by prioritizing orthogonal instructions and conditional execution on nearly all operations, enabling compact code without branches.

32-Bit Architectures (Armv4 to Armv7)

The 32-bit ARM architectures from Armv4 to Armv7 represent a period of significant evolution in the (ISA), focusing on code density, performance enhancements for embedded and applications, and support for diverse processor profiles. These versions built upon the foundational load/store RISC of earlier architectures, emphasizing low power consumption and scalability for mobile and embedded systems. Key shared features include a set of 16 general-purpose 32-bit registers (R0–R15, where R13 serves as the stack pointer, R14 as the , and R15 as the ) and extensive conditional execution capabilities, allowing nearly all instructions to be predicated on the application program status register (APSR) flags without branching, which reduces code size and improves branch prediction efficiency in pipelines. Pipeline implementations varied by core, ranging from simple 3-stage designs in early Armv4 processors to deeper 8–13 stage superscalar pipelines in Armv7 for higher performance, enabling and better instruction throughput while maintaining compatibility. Armv4, released in 1996, marked the introduction of the instruction set in its Armv4T variant, providing 16-bit compressed instructions that offered up to 30–40% better code density compared to the standard 32-bit instructions, ideal for memory-constrained embedded devices. This version was prominently implemented in the ARM7TDMI core, a 3-stage pipelined processor widely used in early mobile phones and PDAs due to its balance of performance and low power. mode allowed seamless interworking with the full set via branch-and-exchange instructions like BX, while retaining the core's load/store model and conditional execution for efficient . Alignment requirements were strict, mandating natural boundaries for word and halfword accesses to avoid faults. Released in 2001, Armv5 enhanced multimedia and signal processing capabilities through its Armv5TE extension, adding DSP-oriented instructions such as enhanced multiply-accumulate operations (e.g., SMULxy for 16-bit signed multiplies) and saturated arithmetic to support fixed-point algorithms with up to 2x performance gains in audio and . The Armv5TEJ variant introduced , a for execution that directly interpreted common bytecodes, reducing software overhead for Java-enabled devices like early smartphones and set-top boxes by interpreting up to 80% of bytecodes natively. Additional features included dual-load/store instructions (LDRD/STRD) for 64-bit transfers and improved Thumb-ARM interworking with BLX, all while preserving the 16-register model and conditional predicates for . Armv6, introduced in 2004, further optimized for media-rich applications with SIMD extensions for parallel 8/16-bit operations on multimedia data, enabling efficient video decoding and image processing in cores like the family. It added support for unaligned memory accesses in load/store instructions (LDR/STR), configurable via system control registers, which eliminated penalties for non-aligned data common in packed structures and improved performance by up to 20% in data-intensive tasks without requiring software alignment fixes. The architecture also integrated the Vector Floating Point (VFP) unit as an optional for single- and double-precision floating-point operations with SIMD capabilities, supporting media workloads in devices like digital cameras and portable media players. Multi-processor primitives, such as exclusive load/store pairs (LDREX/STREX), were introduced to facilitate scalable shared-memory systems. The Armv7 architecture, launched in 2007, consolidated advancements into three profiles—A for applications (e.g., smartphones with MMU support), R for real-time (e.g., automotive with tightly coupled ), and M for microcontrollers (e.g., low-power IoT)—each tailored to market needs while sharing the core ISA. -2 emerged as a major enhancement, mixing 16- and 32-bit instructions for near-ARM performance with Thumb density, including conditional branches and table branches for better loop handling and up to 30% code size reduction. Advanced SIMD was boosted via the extension, a 128-bit vector unit supporting and floating-point operations for acceleration, delivering 4x–8x speedup in tasks like video encoding on Cortex-A8 cores. support via the Virtualization Extensions (VE) enabled secure modes with stage-2 address translation, facilitating isolated execution environments in Armv7-A profiles. These features, combined with RCT for dynamic and enhanced pipelines (e.g., 8-stage in Cortex-A8), positioned Armv7 as the foundation for modern .

64-Bit Architectures (Armv8 and Armv9)

The Armv8 architecture, introduced in 2011, marked the transition to within the Arm family by introducing the execution state alongside the legacy AArch32 state for . features 31 general-purpose 64-bit registers named X0 through X30, enabling larger address spaces and enhanced integer arithmetic compared to the 32-bit registers of prior architectures. This architecture supports multiple profiles: the A-profile for high-performance applications, the R-profile for real-time systems, and the M-profile for microcontrollers, each tailored to specific use cases while sharing core 64-bit capabilities. For memory addressing in AArch32 mode, Armv8 incorporates the Large (LPAE), which expands physical addressing to 40 bits, allowing up to 1 terabyte of addressable memory beyond the traditional 32-bit limit. with AArch32 ensures that existing 32-bit Arm software can run without modification by switching execution states, facilitating a gradual migration to 64-bit operations. Subsequent refinements to Armv8, starting with Armv8.1 in 2016 and continuing through later versions, introduced specialized extensions to enhance reliability and computational efficiency. The (RAS) extensions, mandatory from Armv8.2, provide mechanisms for error detection, reporting, and recovery, such as error record registers and support, improving system robustness in server and embedded environments. Additionally, the Armv8.4 dot-product instructions enable efficient vectorized accumulation of 8-bit integer multiplications into 32-bit results, accelerating workloads like inference by optimizing matrix operations. The Armv9 architecture, announced in 2021, builds on Armv8 by integrating advanced vector processing and security features to address emerging demands in AI and data protection. Central to Armv9 is the Scalable Vector Extension version 2 (SVE2), a superset of the original SVE that supports vector lengths from 128 to 2048 bits in increments of 128 bits, enabling scalable SIMD operations for high-performance computing and machine learning across diverse hardware implementations. SVE2 incorporates functionality from Advanced SIMD (Neon) while adding instructions for digital signal processing and gather-scatter memory access, promoting code portability without vector-length-specific optimizations. For security, Armv9 introduces the Memory Tagging Extension (MTE), which assigns 4-bit tags to memory allocations and pointers, enabling hardware-enforced checks to detect spatial memory errors like buffer overflows at runtime. Complementing MTE is the Confidential Compute Architecture (CCA), a framework for secure enclaves that isolates sensitive workloads from privileged software, including the hypervisor and OS, using realms and attestation for confidential computing scenarios. In 2025, the Armv9.7-A extension further advances A-profile capabilities for AI-driven systems, adding new instructions to SVE and the Scalable Matrix Extension (SME) for handling 6-bit data types in formats like OCP MXFP6, which optimize memory usage and bandwidth for efficient AI model execution. These enhancements, released in October 2025, also include scalability improvements such as targeted TLB invalidations for multi-chip configurations and expanded resource partitioning in MPAMv2, supporting larger-scale AI deployments without compromising performance.

Architectural Features and Extensions

Instruction Set Modes and Enhancements

The ARM architecture supports multiple execution modes to manage privilege levels and handle exceptions, evolving from the 32-bit ARMv7 designs to the 64-bit in Armv8 and later. In ARMv7-A and ARMv7-R profiles, there are seven processor modes: User (USR), which is unprivileged and used for application execution; (SVC), a privileged mode for operating system tasks; (IRQ) for general interrupts; Fast Interrupt Request (FIQ) for low-latency interrupts with dedicated registers; Abort for memory access errors; Undefined for unimplemented instructions; and (SYS), a privileged mode for non-exception kernel code. These modes determine access to registers and resources, with privileged modes (all except USR) enabling system control operations. In Armv8-A and Armv9-A, the model shifts to four exception levels (EL0 to EL3) for finer privilege separation: EL0 is unprivileged, akin to User mode for applications; EL1 is privileged for OS kernels, similar to Supervisor; EL2 supports hypervisors; and EL3 handles secure monitoring and TrustZone. Exceptions taken to higher levels increase privilege, with EL3 being the highest for secure state management. A key efficiency feature in the ARM instruction set is conditional execution, allowing most instructions to be predicated on the Application Program Status Register (APSR) flags without branching, thereby reducing pipeline stalls and improving in control-flow intensive . There are 16 condition codes, including EQ (equal), NE (not equal), CS/HS (carry set/unsigned higher or same), CC/LO (carry clear/unsigned lower), MI (minus/negative), PL (plus/positive or zero), VS (overflow), VC (no overflow), HI (unsigned higher), LS (unsigned lower or same), GE (signed greater or equal), LT (signed less than), GT (signed greater than), LE (signed less than or equal), AL (always), and NV (never). In AArch32 (32-bit execution state), instructions append a two-bit condition suffix; in Thumb-2 and , the IT (If-Then) instruction or equivalent enables up to four conditional instructions following a condition check. This mechanism minimizes branch instructions, which can account for significant overhead in embedded and mobile applications. To enhance code density, ARM introduced the Thumb instruction set in Armv4T, compressing common 32-bit ARM instructions into 16-bit encodings, followed by Thumb-2 in Armv6T2 and Armv7, which mixes 16-bit and 32-bit instructions for broader functionality while maintaining compactness. Thumb-2 achieves up to 40% smaller code size compared to pure ARM instructions, improving cache efficiency and reducing in resource-constrained systems like mobiles and embedded devices. ThumbEE, an extension in Armv7-A, modifies Thumb-2 for dynamic code generation, such as , by altering load/store behaviors and adding instructions like BLX(2) for better branch prediction in runtime-optimized environments. The architecture integrates s (CP0 to CP15) for specialized tasks, with instructions like MCR and MRC facilitating data transfer and control between the ARM core and these units. CP15 serves as the system control , managing cache, MMU, and privilege configurations via registers accessed in privileged modes. DBX, introduced in Armv5TEJ, enables direct execution of in a dedicated state ( mode), bypassing interpretation for faster performance, with variable-length instructions aligned to bytes and support for dynamic .

SIMD, DSP, and Multimedia Extensions

The ARM architecture incorporates several extensions to enhance (SIMD) processing, (DSP), and multimedia workloads, enabling efficient parallel operations on vectors of data elements. These extensions build upon the base instruction set to accelerate tasks such as audio/video encoding, image processing, and inference, particularly in resource-constrained environments like mobile and embedded systems. The Vector Floating Point (VFP) extension, introduced in Armv5 and further developed in subsequent versions including Armv7, provides dedicated hardware for single-precision and double-precision floating-point operations, supporting up to 32 64-bit registers for scalar and vector computations. It enables fused multiply-add operations and conversions between integer and floating-point formats, which are essential for algorithms requiring precise numerical handling. VFP is integrated with the Advanced SIMD unit in later implementations, allowing seamless switching between integer and floating-point modes without pipeline stalls. Advanced SIMD, known as NEON and available from ARMv7 onward, introduces 128-bit vector registers that support operations on 8-bit, 16-bit, 32-bit, and 64-bit elements, including arithmetic, logical, and permutation instructions. NEON includes fused multiply-accumulate (MAC) instructions tailored for DSP tasks, such as filtering in audio processing, and is widely used for multimedia acceleration, including video decoding where it can process multiple pixels or coefficients in parallel to achieve up to several times the performance of scalar code. For instance, NEON's load/store instructions with structure handling optimize data movement for codecs like H.264, reducing demands in real-time applications. For the M-profile cores targeting embedded and applications, the technology—formally the M-Profile Vector Extension (MVE) in ARMv8.1-M—delivers SIMD and DSP capabilities with up to 128-bit vectors, supporting integer, fixed-point, and single-precision floating-point operations on 8- to 32-bit elements. includes tail-predication and fault-handling mechanisms to manage variable-length vectors efficiently, making it suitable for workloads like inference on low-power devices, where it can provide up to 15 times the performance uplift over scalar implementations for certain DSP functions. Its compact instruction encoding ensures minimal code size increase, ideal for resource-limited IoT systems. The Scalable Vector Extension (SVE) in ARMv8-A and its enhancement SVE2 in ARMv9-A introduce vector lengths ranging from 128 to 2048 bits, allowing hardware-agnostic code that scales across implementations without recompilation. SVE supports gather-scatter memory accesses for non-contiguous data patterns common in sparse computations, along with first-faulting predication to handle irregular loops efficiently, which is crucial for and AI training. SVE2 expands this with additional integer and fixed-point instructions, bridging gaps for broader DSP and use cases beyond floating-point dominance in SVE. In 2025, optimizations in frameworks like leverage SVE2 for enhanced AI performance on ARMv9 cores, including kernel fusions that exploit scalable vectors for up to 2.5 times faster inference on transformer-based models (e.g., BERT, Llama) compared to fixed-width SIMD. The Scalable Matrix Extension (SME), introduced in Armv9.2-A, enhances capabilities with scalable tiles up to 256x256 elements, accelerating AI training and inference workloads by providing dedicated hardware for outer-product operations on integers and floating-point data. SME, along with its enhancement SME2, supports a wide range of data types including bfloat16 and int8, enabling efficient computations in high-performance servers and AI accelerators.

Security and Virtualization Features

The ARM architecture family incorporates hardware-based and features to enable secure execution environments, isolation of sensitive operations, and against common software vulnerabilities. These mechanisms are integral to supporting trusted execution in diverse applications, from embedded devices to servers, by partitioning system resources and enforcing access controls at the hardware level. Key features include TrustZone for runtime isolation and extensions like Pointer Authentication and Memory Tagging for mitigating exploits. TrustZone, introduced in Armv6 and available in subsequent architectures, partitions the system into Secure and Normal worlds, allowing secure software to access both while restricting normal world access to secure resources. This enables dual-OS support, where a rich OS runs in the normal world and a trusted OS or secure applications operate in the secure world, often augmented by dedicated crypto accelerators for operations like and . The hardware enforces isolation through a non-secure (NS) bit in memory addresses and peripherals, preventing unauthorized access and protecting against software attacks. For microcontroller units (MCUs), Armv8-M introduces a lightweight variant of TrustZone tailored for resource-constrained embedded systems. This extension provides secure and non-secure memory partitioning without the overhead of a full , using signal-based transitions between security states and separate handling for each world. It supports multiple secure function entry points, enabling fine-grained protection for IoT devices while maintaining low power consumption. Virtualization support begins with the Virtualization Extensions (VE) in Armv7, which introduce a hypervisor mode (Hyp mode in AArch32) for managing guest operating systems. In Armv8 and later, this evolves into Exception Level 2 (EL2) in , allowing s to oversee multiple virtual machines through stage-2 address translation, which applies additional memory mappings on top of guest-level stage-1 translations. This enables efficient isolation of virtualized workloads, with EL2 handling traps and context switches to prevent guest interference. Secure virtualization in Armv8.4 further extends EL2 to the secure world, supporting nested isolation for trusted payloads. The Armv8.3 extension adds Pointer Authentication Codes (PAC), which embed cryptographic signatures into pointer values to detect and prevent manipulation in (ROP) and jump-oriented programming (JOP) attacks. PAC uses dedicated keys stored in system registers, with instructions like PACIA (authenticate instruction address) verifying pointers on load and use, providing low-overhead protection without altering the ABI. This feature is mandatory in Armv8.3-A and extends to Armv9. The Armv8.5-A extension introduces the Memory Tagging Extension (MTE), which is included in Armv9, to address issues like buffer overflows and use-after-free errors, which contribute to 70% of serious security vulnerabilities. MTE assigns 4-bit tags to 16-byte granules, checked on every load/store against a pointer's allocation tag; mismatches trigger faults, enabling proactive detection with minimal performance impact through . The Realm Management Extension (RME) in Armv9-A enhances confidential computing by introducing Realms as isolated execution environments beyond Secure and Normal worlds, managed by a Root-of-Trust through dynamic attestation and attestation tokens. RME adds two new security states and exception levels (EL0r/EL1r in Realm state), supporting stage-3 translation for hypervisor oversight of Realms without exposing data, thus enabling secure multi-tenant cloud workloads.

Applications and Ecosystems

Embedded and Real-Time Systems

The ARM architecture family has established a dominant position in embedded and real-time systems, particularly through its Cortex-M and Cortex-R processor profiles, which prioritize low power consumption, deterministic performance, and reliability in resource-constrained environments. Cortex-M cores, optimized for microcontrollers (MCUs), power a wide array of devices from simple sensors to complex control units, enabling efficient operation in battery-powered or energy-limited scenarios. Meanwhile, Cortex-R cores target applications requiring predictable real-time responses, such as those in industrial automation and systems. Cortex-M processors hold a leading market share in the embedded MCU sector, capturing approximately 69% in 2024 and projected to maintain around 70% through 2025, driven by their , power efficiency, and ecosystem support. Prominent examples include ' series, which leverages Cortex-M cores for versatile embedded applications like and industrial controls, and NXP's RT crossover MCUs, featuring Cortex-M7 and Cortex-M4 cores for high-performance real-time processing in and human-machine interfaces. These implementations highlight the M-profile's scalability, supporting everything from basic 8-bit replacements to advanced 32-bit tasks without compromising on low-power attributes. In real-time systems, Cortex-R processors excel in environments demanding low-latency and fault-tolerant operation, commonly deployed in storage controllers for and printers for precise timing in print mechanisms. Safety-critical certifications further bolster their adoption; for instance, cores like Cortex-R52 and Cortex-R5 have achieved compliance up to ASIL D, facilitating use in automotive and industrial systems where is paramount. The proliferation of Internet of Things (IoT) devices underscores ARM's impact, with over 21 billion connected endpoints globally as of 2025, many powered by Cortex-M for their energy-efficient design. These cores incorporate low-power modes, such as sleep and deep sleep states triggered by wait-for-interrupt (WFI) instructions, allowing devices to enter ultra-low consumption phases while maintaining rapid wake-up for event-driven tasks. Armv8-M architecture enhances in IoT deployments through TrustZone technology, partitioning resources into secure and non-secure worlds to protect sensitive and from unauthorized access, thereby addressing vulnerabilities in connected ecosystems. Complementing this, ARM supports integrations, where Cortex-M-based systems draw power from ambient sources like vibrations or light, extending operational life in remote or battery-free applications through efficient circuits.

Mobile, Desktop, and Server Deployments

The architecture dominates the landscape, powering over 99% of smartphones worldwide as of 2024, a position it has maintained through custom implementations by major vendors. Apple's A-series and M-series processors, based on ARM's A-profile, drive devices with integrated neural processing units for AI tasks, while Qualcomm's Snapdragon series, licensed from ARM, supports the majority of Android smartphones, emphasizing high-performance cores for gaming and . This near-universal adoption stems from ARM's energy-efficient design, which balances battery life and performance in power-constrained environments. A key innovation in mobile deployments is ARM's big.LITTLE technology, which integrates high-performance "big" cores for demanding tasks like video rendering with energy-efficient "LITTLE" cores for background operations, enabling dynamic workload allocation to optimize power consumption without sacrificing responsiveness. Widely implemented in Snapdragon and other SoCs, big.LITTLE has become foundational for in smartphones, allowing devices to handle AI inference and processing efficiently. In desktop and PC markets, ARM-based systems are experiencing growth, particularly through Windows on ARM initiatives, reaching approximately 14% market share in early 2025, with ongoing growth driven by AI-capable hardware. Microsoft's Copilot+ PCs, launched in 2024 and expanded in 2025, leverage Qualcomm's Snapdragon X Elite processors—featuring custom Oryon cores derived from Nuvia designs—to deliver native ARM performance for productivity and AI workloads, marking a shift from traditional x86 dominance in Windows ecosystems. Recent Armv9 adoption has accelerated in these AI PCs. These deployments highlight ARM's scalability to higher-power scenarios, offering improved battery life in laptops compared to Intel counterparts. ARM's expansion into servers focuses on cloud and data center applications, where processors like and Altra provide alternatives to x86 for cost-sensitive, high-density computing. As of mid-2025, ARM-based servers have captured approximately 25% of the server market, fueled by adoption in hyperscale environments for web services and inference. Leading providers such as AWS utilize instances for their energy efficiency, achieving up to 60% better power utilization than comparable x86 systems, which translates to substantial cost savings in large-scale operations— for instance, a 10% gain can save millions annually for providers like AWS. Altra complements this by targeting edge and cloud workloads with multi-threaded scalability, further emphasizing ARM's role in sustainable growth, supported by strong Q3 2025 revenue momentum.

Automotive and Industrial Uses

The ARM architecture plays a pivotal role in automotive applications, particularly in safety-critical systems such as advanced driver-assistance systems (ADAS) and electronic control units (ECUs) for engine management and braking. Cortex-R and Cortex-A processors, part of the R-profile and A-profile respectively, are widely deployed in these ECUs to handle real-time processing and complex computations, supporting up to D (ASIL-D) as defined by ISO 26262. For redundancy, core configurations in processors like the Cortex-R52 enable fault detection by running identical instructions in parallel and comparing outputs, enhancing reliability in harsh operating conditions. These systems often operate across extended temperature ranges, typically from -40°C to 125°C, to withstand automotive environments. In-vehicle infotainment (IVI) systems also leverage ARM-based solutions for processing and connectivity, with scalable Cortex-A cores providing efficient performance for user interfaces and features. Notable examples include NVIDIA's DRIVE Orin platform, which integrates Armv8-based CPU cores for ADAS and autonomous driving compute, delivering up to 254 of AI performance in a safety-certified design. Similarly, Renesas' R-Car series, such as the R-Car V4H, employs multiple cores for ADAS and IVI applications, achieving ASIL-D systematic capability through integrated safety mechanisms. ARM technology powers solutions in 94% of global automakers, underscoring its dominance in automotive system-on-chips (SoCs). In industrial applications, ARM architectures support rugged, safety-critical environments like and programmable logic controllers (PLCs), where real-time control and are essential. The Armv8-R architecture, designed for deterministic performance, enables in these systems by providing features for error detection and recovery, suitable for applications requiring compliance with standards like IEC 61508. For instance, utilizes ARM-based platforms with SystemReady certification for software-defined PLCs, facilitating low-latency automation and secure operations in . In , Cortex-A and Cortex-R processors manage and , often incorporating redundancy to mitigate single-point failures in dynamic industrial settings. Industrial ARM implementations commonly feature extended ratings up to 125°C to endure factory floor conditions.

Standards and Certifications

Operating System Support

The has provided mainline support for architectures since 1994, with kernel version 2.6 (released in 2003) introducing significant multi-platform enhancements that improved broad compatibility. Subsequent versions added support for 32-bit Armv7 (starting around 2007) and 64-bit Armv8/Armv9 (from 2012 onward) implementations across embedded, server, and desktop environments. Major distributions have adapted this support extensively; for instance, offers official 64-bit server and desktop images optimized for processors like those in and cloud instances, while provides comprehensive editions for aarch64 hardware ranging from single-board computers to enterprise servers. Android, built on the Android Open Source Project (AOSP), has been predominantly designed for architectures since its inception, with Armv7 and Armv8 dominating the ecosystem due to their efficiency in mobile devices; the platform includes specific optimizations for ARM's SIMD extensions in the Native Development Kit (NDK) to enhance multimedia and AI workloads. Google's Chrome OS has supported architectures since version 5 in 2010, with native Armv7 and later Armv8/Armv9 compatibility for Chromebooks, enabling efficient deployment in and lightweight . Microsoft's Windows on ARM64, introduced in 2017 with , supports native 64-bit applications on Armv8 processors, and by 2025, it incorporates the Prism emulation layer in 24H2 and later to run x86/x64 software more efficiently, including advanced vector instructions like AVX/AVX2 for broader app compatibility. For embedded systems, offers official ports for and Cortex-A cores, providing a lightweight real-time OS kernel with low suitable for microcontrollers and IoT devices. Apple's macOS, starting with in 2020, runs natively only on its custom ARM-based processors (Armv8-A derivatives), leveraging the architecture's power efficiency for laptops and desktops without support for non-Apple ARM hardware. Porting operating systems to ARM involves challenges such as adapting to the ARM (ABI), which differs from x86 in areas like procedure call standards and data types (e.g., AAPCS64 for 64-bit), requiring recompilation or rewriting of binaries and libraries. Additionally, support often necessitates custom development or upstreaming to the mainline kernel, as ARM's diverse SoC ecosystem demands platform-specific integrations for peripherals like GPUs and interrupts, potentially increasing porting time and testing efforts.

Arm SystemReady and PSA Certified

Arm SystemReady is a compliance program developed by to promote across Arm-based hardware platforms by standardizing and processes, enabling off-the-shelf operating systems like and Android to and operate without hardware-specific modifications. The program is divided into bands tailored to different use cases: SystemReady SR targets desktop and server environments, ensuring compatibility with standard server OS distributions through defined hardware and interfaces, while SystemReady ES focuses on embedded systems for IoT and edge applications, supporting lightweight flows suitable for resource-constrained devices. This structure reduces ecosystem fragmentation, allowing developers to deploy software across diverse hardware without extensive validation efforts. Central to SystemReady compliance are key components such as the Framework for A-profile (FF-A), which specifies secure interfaces for firmware components to manage resource access and isolation between secure and non-secure worlds, often leveraging hardware like TrustZone for protection. For server-oriented SR compliance, Baseboard Management standards, including the Server Base Manageability Requirements (SBMR), integrate Baseboard Management Controllers (BMC) to enable remote monitoring, updates, and hardware oversight independent of the host OS. Validated platforms exemplify these standards; for instance, Qualcomm's Snapdragon-based platforms have achieved SystemReady compliance for embedded and IoT use cases, while Ampere's Mt. Jade server platform meets SR requirements, contributing to over 150 compliant systems available as of 2025. The PSA Certified framework, originally launched by Arm and transferred to GlobalPlatform governance in September 2025, provides a standardized IoT security assurance scheme to evaluate and certify the security posture of chips, firmware, and devices against defined threat models. It encompasses assurance levels from 1 to 4: Level 1 involves vendor self-declaration of security requirements for the Platform Security Architecture (PSA); Level 2 requires independent lab testing of the PSA Root of Trust (PSA-RoT) for basic software vulnerabilities; Level 3 extends evaluation to substantial physical and sophisticated software attacks on the RoT; and Level 4 targets high robustness for isolated Secure Elements (iSE) or Secure Elements (SE), protecting high-value assets like cryptographic keys. Core elements include the PSA-RoT, a minimal trusted component providing immutable security functions such as secure boot to verify firmware integrity and prevent unauthorized code execution from compromising the system. In 2025, PSA Certified expanded to address emerging needs in AI edge devices, incorporating certifications for processors with integrated AI accelerators that maintain secure isolation for models and data processing. For example, Renesas' RZ/V2L microprocessor, featuring an CPU and built-in AI accelerator, achieved PSA Certified Level 2, demonstrating resistance to common IoT threats while supporting edge AI workloads. As of late 2025, the program has surpassed 250 certifications across nearly 90 providers, with over 100 certified chips enabling secure deployment in connected ecosystems.

Recent Developments and Innovations

In 2025, Arm introduced the Armv9 Edge AI platform, optimized for (IoT) devices, featuring the new Cortex-A320 CPU core and Ethos-U85 Neural Processing Unit (NPU). This solution enables on-device execution of AI models exceeding 1 billion parameters, delivering up to 10 times the performance compared to prior generations while maintaining ultra-low power efficiency for edge applications. Arm underwent a significant rebranding of its processor platforms in June 2025 to better align with market-specific needs and emphasize full-system solutions. The mobile segment now falls under the Lumex branding, targeting smartphones and tablets with AI-optimized cores, while the Niva brand was introduced for personal computers (PCs), focusing on in desktops and laptops. This shift moves beyond the traditional Cortex naming, incorporating Compute Subsystems (CSS) for integrated CPU, GPU, and NPU designs to accelerate development for partners. Supporting these advancements, released ExecuTorch 1.0 in October 2025, a lightweight runtime co-developed with Meta for deploying models on edge devices. This tool enables efficient on-device AI across CPUs, GPUs, and NPUs, supporting large models (LLMs) and vision tasks with broader hardware compatibility and production-ready stability. Concurrently, the A-profile received updates in Armv9.7-A, including enhancements to through Memory Partitioning and Monitoring (MPAMv2) for improved resource partitioning, , and system profiling with up to 16-bit Partition Monitoring Groups (PMGs). These changes, alongside AI-specific extensions like Scalable Vector Extension (SVE)/Scalable Matrix Extension (SME) instructions for 6-bit data types, reduce memory bandwidth in workloads without explicit mentions of branch prediction refinements. Ecosystem expansions gained momentum at 2025, where showcased deeper integrations for Azure cloud and Windows on Arm PCs, emphasizing AI acceleration and sustainable computing. This collaboration supports Arm's push into the PC market, with forecasts indicating Arm-based laptops could reach 20% of global shipments by year-end, driven by premium devices from and emerging offerings from and . Arm's leadership has set a long-term ambition for over 50% Windows PC by 2029, building on 2025's projected 13-20% foothold amid competition from x86 architectures. Financially, Armv9 architectures contributed to robust growth, with quarterly revenue surpassing $1 billion in Q4 FY2025 (ending March 2025) and annual sales exceeding $4 billion, fueled by licensing and royalties from AI, , and deployments. Royalty revenue grew 25-30% year-over-year in early FY2026 quarters, underscoring Armv9's impact on premium shipments.

References

  1. https://en.wikichip.org/wiki/acorn/microarchitectures/arm3
  2. https://en.wikichip.org/wiki/arm/armv2
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.