Recent from talks
Nothing was collected or created yet.
IA-64
View on Wikipedia| Designer | HP and Intel |
|---|---|
| Bits | 64-bit |
| Introduced | 2001 |
| Design | EPIC |
| Type | Load–store |
| Encoding | Fixed |
| Branching | Condition register |
| Endianness | Selectable |
| Registers | |
| General-purpose | 128 (64 bits plus 1 trap bit; 32 are static, 96 use register windows); 64 1-bit predicate registers |
| Floating-point | 128 |
IA-64 (Intel Itanium architecture) is the instruction set architecture (ISA) of the discontinued Itanium family of 64-bit Intel microprocessors. The basic ISA specification originated at Hewlett-Packard (HP), and was subsequently implemented by Intel in collaboration with HP. The first Itanium processor, codenamed Merced, was released in 2001.
The Itanium architecture is based on explicit instruction-level parallelism, in which the compiler decides which instructions to execute in parallel. This contrasts with superscalar architectures, which depend on the processor to manage instruction dependencies at runtime. In all Itanium models, up to and including Tukwila, cores execute up to six instructions per cycle.
In 2008, Itanium was the fourth-most deployed microprocessor architecture for enterprise-class systems, behind x86-64, Power ISA, and SPARC.[1]
In 2019, Intel announced the discontinuation of the last supported CPUs for the IA-64 architecture. Microsoft Windows versions from Server 2003[2] to Server 2008 R2[3] supported IA-64; later versions did not support it. The Linux kernel supported it for much longer but dropped support by version 6.7 in 2024 (while still supported in Linux 6.6 LTS). Only a few other operating systems, such as HP-UX, OpenVMS, and FreeBSD, ever supported IA-64; HP-UX and OpenVMS still support it, but FreeBSD discontinued support in FreeBSD 11.
History
[edit]
Development
[edit]In 1989, HP began to become concerned that reduced instruction set computing (RISC) architectures were approaching a processing limit at one instruction per cycle. Both Intel and HP researchers had been exploring computer architecture options for future designs and separately began investigating a new concept known as very long instruction word (VLIW)[4] which came out of research by Yale University in the early 1980s.[5]
VLIW is a computer architecture concept (like RISC and CISC) where a single instruction word contains multiple instructions encoded in one very long instruction word to facilitate the processor executing multiple instructions in each clock cycle. Typical VLIW implementations rely heavily on sophisticated compilers to determine at compile time which instructions can be executed at the same time and the proper scheduling of these instructions for execution and also to help predict the direction of branch operations. The value of this approach is to do more useful work in fewer clock cycles and to simplify processor instruction scheduling and branch prediction hardware requirements, with a penalty in increased processor complexity, cost, and energy consumption in exchange for faster execution.
Production
[edit]During this time, HP had begun to believe that it was no longer cost-effective for individual enterprise systems companies such as itself to develop proprietary microprocessors. Intel had also been researching several architectural options for going beyond the x86 ISA to address high-end enterprise server and high-performance computing (HPC) requirements.
Intel and HP partnered in 1994 to develop the IA-64 ISA, using a variation of VLIW design concepts which Intel named explicitly parallel instruction computing (EPIC). Intel's goal was to leverage the expertise HP had developed in their early VLIW work along with their own to develop a volume product line targeted at the aforementioned high-end systems that could be sold to all original equipment manufacturers (OEMs), while HP wished to be able to purchase off-the-shelf processors built using Intel's volume manufacturing and contemporary process technology that were better than their PA-RISC processors.
Intel took the lead on the design and commercialization process, while HP contributed to the ISA definition, the Merced/Itanium microarchitecture, and Itanium 2. The original goal year for delivering the first Itanium family product, Merced, was 1998.[4]
Marketing
[edit]Intel's product marketing and industry engagement efforts were substantial and achieved design wins with the majority of enterprise server OEMs, including those based on RISC processors at the time. Industry analysts predicted that IA-64 would dominate in servers, workstations, and high-end desktops, and eventually supplant both RISC and CISC architectures for all general-purpose applications.[6][7] Compaq and Silicon Graphics decided to abandon further development of the Alpha and MIPS architectures respectively in favor of migrating to IA-64.[8]
By 1997, it was apparent that the IA-64 architecture and the compiler were much more difficult to implement than originally thought, and the delivery of Itanium began slipping.[9] Since Itanium was the first ever EPIC processor, the development effort encountered more unanticipated problems than the team was accustomed to. In addition, the EPIC concept depended on compiler capabilities that had never been implemented before, so more research was needed.[10]
Several groups developed operating systems for the architecture, including Microsoft Windows, Unix and Unix-like systems such as Linux, HP-UX, FreeBSD, Solaris,[11][12][13] Tru64 UNIX,[8] and Monterey/64[14] (the last three were canceled before reaching the market). In 1999, Intel led the formation of an open-source industry consortium to port Linux to IA-64 they named "Trillium" (and later renamed "Trillian" due to a trademark issue), which was led by Intel and included Caldera Systems, CERN, Cygnus Solutions, Hewlett-Packard, IBM, Red Hat, SGI, SuSE, TurboLinux and VA Linux Systems. As a result, a working IA-64 Linux was delivered ahead of schedule and was the first OS to run on the new Itanium processors.
Intel announced the official name of the processor, Itanium, on October 4, 1999.[15] Within hours, the name Itanic had been coined on a Usenet newsgroup as a pun on the name Titanic, the "unsinkable" ocean liner that sank on its maiden voyage in 1912.[16]
The very next day on 5th October 1999, AMD announced their plans to extend Intel's x86 instruction set to include a fully downward compatible 64-bit mode, additionally revealing AMD's newly coming x86 64-bit architecture, which the company had already worked on, to be incorporated into AMD's upcoming eighth-generation microprocessor, code-named SledgeHammer.[17] AMD also signaled a full disclosure of the architecture's specifications and further details to be available in August 2000.[18]
As AMD was never invited to be a contributing party for the IA-64 architecture and any kind of licensing seemed unlikely, AMD's AMD64 architecture-extension was positioned from the beginning as an evolutionary way to add 64-bit computing capabilities to the existing x86 architecture, while still supporting legacy 32-bit x86 code, as opposed to Intel's approach of creating an entirely new, completely x86-incompatible 64-bit architecture with IA-64.
End of life
[edit]In January 2019, Intel announced that Kittson would be discontinued, with a last order date of January 2020, and a last ship date of July 2021.[19][20] In November 2023, IA-64 support was removed from the Linux kernel and is since then maintained out-of-tree.[21][22][23]
Architecture
[edit]Intel has extensively documented the Itanium instruction set[24] and the technical press has provided overviews.[6][9]
The architecture has been renamed several times during its history. HP originally called it PA-WideWord. Intel later called it IA-64, then Itanium Processor Architecture (IPA),[25] before settling on Intel Itanium Architecture, but it is still widely referred to as IA-64.
It is a 64-bit register-rich explicitly parallel architecture. The base data word is 64 bits, byte-addressable. The logical address space is 264 bytes. The architecture implements predication, speculation, and branch prediction. It uses variable-sized register windowing for parameter passing. The same mechanism is also used to permit parallel execution of loops. Speculation, prediction, predication, and renaming are under control of the compiler: each instruction word includes extra bits for this. This approach is the distinguishing characteristic of the architecture.
The architecture implements a large number of registers:[26][27][28]
- 128 general integer registers, which are 64-bit plus one trap bit ("NaT", which stands for "not a thing") used for speculative execution. 32 of these are static, the other 96 are stacked using variably-sized register windows, or rotating for pipelined loops.
gr0always reads 0. - 128 floating-point registers. The floating-point registers are 82 bits long to preserve precision for intermediate results. Instead of a dedicated "NaT" trap bit like the integer registers, floating-point registers have a trap value called "NaTVal" ("Not a Thing Value"), similar to (but distinct from) NaN. These also have 32 static registers and 96 windowed or rotating registers.
fr0always reads +0.0, andfr1always reads +1.0. - 64 one-bit predicate registers. These have 16 static registers and 48 windowed or rotating registers.
pr0always reads 1 (true). - 8 branch registers, for the addresses of indirect jumps.
br0is set to the return address when a function is called withbr.call. - 128 special purpose (or "application") registers, which are mostly of interest to the kernel and not ordinary applications. For example, one register called
bsppoints to the second stack, which is where the hardware will automatically spill registers when the register window wraps around.
Each 128-bit instruction word is called a bundle, and contains three slots each holding a 41-bit instruction, plus a 5-bit template indicating which type of instruction is in each slot. Those types are M-unit (memory instructions), I-unit (integer ALU, non-ALU integer, or long immediate extended instructions), F-unit (floating-point instructions), or B-unit (branch or long branch extended instructions). The template also encodes stops which indicate that a data dependency exists between data before and after the stop. All instructions between a pair of stops constitute an instruction group, regardless of their bundling, and must be free of many types of data dependencies; this knowledge allows the processor to execute instructions in parallel without having to perform its own complicated data analysis, since that analysis was already done when the instructions were written.
Within each slot, all but a few instructions are predicated, specifying a predicate register, the value of which (true or false) will determine whether the instruction is executed. Predicated instructions which should always execute are predicated on pr0, which always reads as true.
The IA-64 assembly language and instruction format was deliberately designed to be written mainly by compilers, not by humans. Instructions must be grouped into bundles of three, ensuring that the three instructions match an allowed template. Instructions must issue stops between certain types of data dependencies, and stops can also only be used in limited places according to the allowed templates.
Instruction execution
[edit]The fetch mechanism can read up to two bundles per clock from the L1 cache into the pipeline. When the compiler can take maximum advantage of this, the processor can execute six instructions per clock cycle. The processor has thirty functional execution units in eleven groups. Each unit can execute a particular subset of the instruction set, and each unit executes at a rate of one instruction per cycle unless execution stalls waiting for data. While not all units in a group execute identical subsets of the instruction set, common instructions can be executed in multiple units.
The execution unit groups include:
- Six general-purpose ALUs, two integer units, one shift unit
- Four data cache units
- Six multimedia units, two parallel shift units, one parallel multiply, one population count
- Two 82-bit floating-point multiply–accumulate units, two SIMD floating-point multiply–accumulate units (two 32-bit operations each)[29]
- Three branch units
Ideally, the compiler can often group instructions into sets of six that can execute at the same time. Since the floating-point units implement a multiply–accumulate operation, a single floating-point instruction can perform the work of two instructions when the application requires a multiply followed by an add: this is very common in scientific processing. When it occurs, the processor can execute four FLOPs per cycle. For example, the 800 MHz Itanium had a theoretical rating of 3.2 GFLOPS and the fastest Itanium 2, at 1.67 GHz, was rated at 6.67 GFLOPS.
In practice, the processor may often be underutilized, with not all slots filled with useful instructions due to e.g. data dependencies or limitations in the available bundle templates. The densest possible code requires 42.6 bits per instruction, compared to 32 bits per instruction on traditional RISC processors of the time, and no-ops due to wasted slots further decrease the density of code. Additional instructions for speculative loads and hints for branches and cache are impractical to generate optimally, because a compiler cannot predict the contents of the different cache levels on a system running multiple processes and taking interrupts.
Memory architecture
[edit]From 2002 to 2006, Itanium 2 processors shared a common cache hierarchy. They had 16 KB of Level 1 instruction cache and 16 KB of Level 1 data cache. The L2 cache was unified (both instruction and data) and is 256 KB. The Level 3 cache was also unified and varied in size from 1.5 MB to 24 MB. The 256 KB L2 cache contains sufficient logic to handle semaphore operations without disturbing the main arithmetic logic unit (ALU).
Main memory is accessed through a bus to an off-chip chipset. The Itanium 2 bus was initially called the McKinley bus, but is now usually referred to as the Itanium bus. The speed of the bus has increased steadily with new processor releases. The bus transfers 2×128 bits per clock cycle, so the 200 MHz McKinley bus transferred 6.4 GB/s, and the 533 MHz Montecito bus transfers 17.056 GB/s[30]
Architectural changes
[edit]Itanium processors released prior to 2006 had hardware support for the IA-32 architecture to permit support for legacy server applications, but performance for IA-32 code was much worse than for native code and also worse than the performance of contemporaneous x86 processors. In 2005, Intel developed the IA-32 Execution Layer (IA-32 EL), a software emulator that provides better performance. With Montecito, Intel therefore eliminated hardware support for IA-32 code.
In 2006, with the release of Montecito, Intel made a number of enhancements to the basic processor architecture including:[31]
- Hardware multithreading: Each processor core maintains context for two threads of execution. When one thread stalls during memory access, the other thread can execute. Intel calls this "coarse multithreading" to distinguish it from the "hyper-threading technology" Intel integrated into some x86 and x86-64 microprocessors.
- Hardware support for virtualization: Intel added Intel Virtualization Technology (Intel VT-i), which provides hardware assists for core virtualization functions. Virtualization allows a software "hypervisor" to run multiple operating system instances on the processor concurrently.
- Cache enhancements: Montecito added a split L2 cache, which included a dedicated 1 MB L2 cache for instructions. The original 256 KB L2 cache was converted to a dedicated data cache. Montecito also included up to 12 MB of on-die L3 cache.
See also
[edit]References
[edit]- ^ Morgan, Timothy (2008-05-27). "The Server Biz Enjoys the X64 Upgrade Cycle in Q1". IT Jungle. Archived from the original on 2016-03-03. Retrieved 2008-10-29.
- ^ "Windows Server 2003 - BetaWiki".
- ^ https://betawiki.net/wiki/Windows_Server_2008_R2
- ^ a b "Inventing Itanium: How HP Labs Helped Create the Next-Generation Chip Architecture". HP Labs. June 2001. Archived from the original on 2012-03-04. Retrieved 2007-03-23.
- ^ Fisher, Joseph A. (1983). "Very Long Instruction Word architectures and the ELI-512". Proceedings of the 10th annual international symposium on Computer architecture. International Symposium on Computer Architecture. New York, NY, USA: Association for Computing Machinery (ACM). pp. 140–150. doi:10.1145/800046.801649. ISBN 0-89791-101-6.
- ^ a b De Gelas, Johan (2005-11-09). "Itanium–Is there light at the end of the tunnel?". AnandTech. Archived from the original on 2012-05-03. Retrieved 2007-03-23.
- ^ Takahashi, Dean (2009-05-08). "Exit interview: Retiring Intel chairman Craig Barrett on the industry's unfinished business". VentureBeat. Archived from the original on 2018-04-21. Retrieved 2009-05-17.
- ^ a b "Itanium: A cautionary tale". Tech News on ZDNet. 2005-12-07. Archived from the original on 2008-02-09. Retrieved 2007-11-01.
- ^ a b Shankland, Stephen (1999-07-08). "Intel's Merced chip may slip further". CNET News. Archived from the original on 2012-10-24. Retrieved 2008-10-16.
- ^ "Microprocessors — VLIW, The Past" (PDF). NY University. 2002-04-18. Archived (PDF) from the original on 2018-06-27. Retrieved 2018-06-26.
- ^ Vijayan, Jaikumar (1999-09-01). "Solaris for IA-64 coming this fall". Computerworld. Archived from the original on 2000-01-15.
- ^ Wolfe, Alexander (1999-09-02). "Core-logic efforts under way for Merced". EE Times. Archived from the original on 2016-03-06. Retrieved February 27, 2016.
- ^ "Sun Introduces Solaris Developer Kit for Intel to Speed Development of Applications On Solaris; Award-winning Sun Tools Help ISVs Easily Develop for Solaris on Intel Today". Business Wire. 1998-03-10. Archived from the original on 2004-09-20. Retrieved 2008-10-16.
- ^ "Next-generation chip passes key milestone". CNET News.com. 1999-09-17. Archived from the original on 2011-08-09. Retrieved 2007-11-01.
- ^ Kanellos, Michael (1999-10-04). "Intel names Merced chip Itanium". CNET News.com. Archived from the original on 2015-12-30. Retrieved 2007-04-30.
- ^ Finstad, Kraig (1999-10-04). "Re:Itanium". Newsgroup: comp.sys.mac.advocacy. Retrieved 2013-12-19.
- ^ "AMD Discloses New Technologies At Microporcessor Forum" (Press release). AMD. October 5, 1999. Archived from the original on March 8, 2012. Retrieved August 15, 2022.
- ^ "AMD Releases x86-64 Architectural Specification; Enables Market Driven Migration to 64-Bit Computing" (Press release). AMD. August 10, 2000. Archived from the original on March 8, 2012. Retrieved August 15, 2022.
- ^ Anton Shilov (January 31, 2019). "Intel to Discontinue Itanium 9700 'Kittson' Processor, the Last of the Itaniums". AnandTech. Archived from the original on April 16, 2019. Retrieved April 16, 2019.
- ^ "Product Change Notification" (PDF). January 30, 2019. Archived (PDF) from the original on February 1, 2019. Retrieved May 9, 2019.
- ^ Larabel, Michael (2 November 2023). "Intel Itanium IA-64 Support Removed With The Linux 6.7 Kernel". www.phoronix.com. Phoronix. Retrieved 4 November 2023.
- ^ "linux-ia64". GitHub. Retrieved October 1, 2024.
Maintenance and development of the Linux operating system for Intel Itanium architecture (IA-64)
- ^ "EPIC Linux". Retrieved October 1, 2024.
- ^ "Intel Itanium Architecture Software Developer's Manual". Archived from the original on 2019-04-08. Retrieved 2019-04-08.
- ^ "HPWorks Newsletter". September 2001. Archived from the original on 2008-11-20. Retrieved 2008-01-24.
- ^ Chen, Raymond (2015-07-27). "The Itanium processor, part 1: Warming up". Archived from the original on 2018-11-01. Retrieved 2018-10-31.
- ^ Chen, Raymond (2015-07-28). "The Itanium processor, part 2: Instruction encoding, templates, and stops". Archived from the original on 2018-11-01. Retrieved 2018-10-31.
- ^ Chen, Raymond (2015-07-29). "The Itanium processor, part 3: The Windows calling convention, how parameters are passed". Archived from the original on 2018-11-01. Retrieved 2018-10-31.
- ^ Sharangpani, Harsh; Arora, Ken (2000). "Itanium Processor Microarchitecture". IEEE Micro. pp. 38–39.
- ^ Cataldo, Anthony (2001-08-30). "Intel outfits Itanium processor for faster runs". EE Times. Archived from the original on 2020-08-01. Retrieved 2020-01-19.
- ^ "Intel product announcement". Intel web site. Archived from the original on November 7, 2007. Retrieved 2007-05-16.
External links
[edit]- Intel Itanium Processors at the Wayback Machine (archived 2015-01-21)
- Hewlett Packard Enterprise Integrity Servers Home Page Archived 2017-05-15 at the Wayback Machine
- Intel Itanium Specifications
- Some undocumented Itanium 2 microarchitectural information at the Wayback Machine (archived 2007-02-23)
- IA-64 tutorial, including code examples
- Itanium Docs at HP
IA-64
View on GrokipediaIntroduction
Overview
IA-64, also known as the Intel Itanium architecture, is a 64-bit instruction set architecture (ISA) developed jointly by Intel and Hewlett-Packard as an implementation of Explicitly Parallel Instruction Computing (EPIC).[8] This collaboration, initiated in the late 1990s, aimed to create a new processor architecture optimized for exploiting instruction-level parallelism through compiler assistance rather than relying solely on hardware mechanisms.[9] At its core, IA-64 features 128 general-purpose registers, each 64 bits wide, enabling extensive data handling for complex computations.[1] Instructions are organized in a bundle-based format, where each 128-bit bundle contains three 41-bit instructions along with a 5-bit template that specifies execution rules, such as which instructions can proceed in parallel or depend on branches.[1] This structure facilitates efficient decoding and supports the architecture's emphasis on parallelism. IA-64 was intended for high-performance computing, enterprise servers, and scientific workloads, positioning it as a successor to Hewlett-Packard's PA-RISC architecture while complementing Intel's x86 lineup for broader market coverage.[2] In contrast to traditional Reduced Instruction Set Computing (RISC) and Complex Instruction Set Computing (CISC) designs, which depend on dynamic hardware scheduling to identify and execute parallel instructions at runtime, EPIC shifts much of this responsibility to the compiler, allowing it to explicitly annotate code for parallelism and reduce hardware complexity.[1]Design principles
The IA-64 architecture is founded on the Explicitly Parallel Instruction Computing (EPIC) paradigm, which shifts the responsibility for extracting instruction-level parallelism (ILP) primarily to the compiler through static scheduling, rather than relying on complex hardware mechanisms for dynamic out-of-order execution as seen in contemporary x86 designs.[10][11] This approach simplifies hardware design by enabling the compiler to explicitly group and order instructions for parallel execution, thereby reducing the need for runtime hardware speculation and renaming, while promoting predictability and scalability in performance.[10] By leveraging advanced compiler optimizations, EPIC aims to expose higher levels of ILP from source code, particularly in loops and control-intensive regions, to achieve superior throughput on wide-issue processors.[11] Central to IA-64's innovations are mechanisms like predication, speculation, and branch hints, which empower the compiler to mitigate common performance bottlenecks without excessive branch mispredictions or latency stalls. Predication employs 64 one-bit predicate registers to conditionally execute instructions, converting control dependencies into data dependencies and eliminating many branches through if-conversion techniques.[10] If a predicate is true, the instruction proceeds normally; if false, it becomes a no-op without altering architectural state, thereby allowing the compiler to schedule instructions across potential branch paths for greater ILP.[10] Speculation further enhances this by supporting control speculation, where loads execute before branches using deferred exception handling via NaT bits, and data speculation, which resolves ambiguous memory dependencies through advanced load tables and check instructions to hide access latencies.[10] Branch hints, such as those indicating likely taken or nontaken paths, provide compiler directives to the hardware for improved prediction and prefetching, optimizing branch behavior without mandating complex dynamic predictors.[10] Register rotation represents another key design element, facilitating efficient software pipelining in loops by dynamically renaming registers across iterations without code duplication or explicit unrolling. In IA-64, subsets of general-purpose registers (GR32–GR127), floating-point registers (FR32–FR127), and predicate registers (PR16–PR63) rotate modulo-style under compiler control, enabling overlapped loop execution where prologues and epilogues are minimized through mechanisms like the current frame marker.[10] This rotation supports modulo scheduling, allowing the compiler to pipeline loop bodies seamlessly and achieve high resource utilization, particularly in compute-intensive kernels, by treating iterations as a steady-state stream of operations.[11] Quality of implementation (QOI) guidelines underscore the architecture's emphasis on hardware-software co-design, requiring compilers to aggressively expose parallelism via predication, speculation, and rotation while balancing code size and resource constraints to fully exploit IA-64's potential.[10] These guidelines highlight implementation-dependent aspects, such as the size of speculation support structures like the advanced load address table, encouraging compilers to minimize speculation overhead and adhere to dependency rules for predictable behavior across processors.[10] By prioritizing compiler sophistication, QOI ensures that the architecture's features deliver scalable performance, with the compiler playing the pivotal role in resolving runtime ambiguities through informed static decisions.[11]History
Origins and collaboration
In June 1994, Intel Corporation and Hewlett-Packard Company (HP) announced a strategic alliance to jointly develop a new 64-bit instruction set architecture (ISA), later named IA-64, marking a significant departure from existing processor designs. This partnership leveraged Intel's expertise in high-volume semiconductor manufacturing and HP's deep knowledge of precision architecture derived from its PA-RISC (Precision Architecture Reduced Instruction Set Computing) lineage, with HP's internal PA-Wide Word (PA-WW) project serving as an initial conceptual foundation for the collaboration.[12][13] The primary motivations for this joint effort stemmed from the recognized limitations of the prevailing 32-bit x86 architecture, particularly its constraints in addressing space, instruction-level parallelism, and scalability for demanding enterprise workloads and high-performance computing (HPC) applications. Intel and HP sought to create a clean-slate 64-bit design unencumbered by the backward-compatible complexities of the CISC-based x86, enabling innovations in explicit parallelism and speculation to deliver superior performance in servers, workstations, and technical computing environments while protecting long-term software investments through strategic compatibility features.[8][14][15] Early milestones in the project included the codenaming of the inaugural IA-64 implementation as Merced, with collaborative work commencing immediately after the 1994 announcement and focusing on architectural specifications that integrated advanced parallelism concepts. By late 1996, initial specifications had been outlined, emphasizing a novel execution model; a key aspect was the planned provision for x86 backward compatibility through on-chip emulation or dynamic translation mechanisms to ensure seamless operation of legacy software without requiring full recompilation.[13][15][16] The collaboration presented notable challenges, as Intel prioritized scalable, high-volume production to penetrate broad markets, while HP advocated for the refined, high-precision engineering principles honed in its PA-RISC development, leading to tensions in design priorities and project timelines that occasionally strained the partnership's dynamics.[17][13]Development milestones
The prototype development for the IA-64 architecture centered on the Merced core, with tape-out occurring on July 4, 1999, followed by production of the first complete test chips in August 1999. First engineering samples were delivered to customers later that year, but early silicon revealed significant performance shortfalls, largely attributable to an inefficient memory subsystem limited to two pipelines and deep pipeline stalls that reduced effective instruction throughput despite the intended 6-wide issue design.[18][3] The Merced core powered the inaugural production IA-64 chip, the Itanium processor, launched in May 2001 at clock speeds of 733–800 MHz on a 180 nm process. This debut implementation struggled to meet expectations due to the unresolved pipeline and memory bottlenecks. The subsequent McKinley core, released in 2002 as the foundation for Itanium 2, enhanced the 6-wide issue capability with up to 1 GHz clock speeds, four memory pipelines, and roughly double the performance of Merced through optimized branch prediction and reduced latency.[19][20][3] Architectural advancements progressed with the Madison core in June 2003 for Itanium 2, shifting to a 130 nm process with clock speeds up to 1.5 GHz and expanded L3 caches of 6 MB, yielding 30–50% better performance over McKinley via improved frequency scaling and cache hierarchy efficiency. The Montecito core followed in 2006, introducing a dual-core configuration on 90 nm, Intel Hyper-Threading Technology for explicit multithreading to boost parallelism, and per-core L3 caches up to 12 MB (24 MB total), further elevating throughput in multithreaded workloads.[21][22][23] The IA-64 instruction set received key extensions in the IA-64-2 revision announced in 2005, incorporating Intel Virtualization Technology (VT-i) for hardware-assisted virtualization and enhancements to floating-point precision and operations to better support scientific computing demands.[24]Production and releases
The production of IA-64 processors, branded as Itanium, was handled exclusively by Intel in its own semiconductor fabrication facilities, beginning with volume manufacturing in 2001 after significant delays with the initial Merced design due to design verification challenges and low initial yields.[20] Yields improved in subsequent generations as process technologies advanced from 180 nm to smaller nodes, enabling more reliable output for enterprise applications.[25] The release timeline commenced with the first Itanium processor (Merced core) in May 2001, targeted at high-end servers.[26] This was followed by the Itanium 2 family, starting with the McKinley core in 2002, Madison in 2003, and dual-core Montecito in July 2006. Later models included the quad-core Tukwila in February 2010, which introduced enhanced reliability, availability, and serviceability (RAS) features for mission-critical computing, and Poulson in November 2012.[27] Production volumes remained modest compared to Intel's x86 lineup, with annual shipments of Itanium-based systems reaching a peak of approximately 26,000 units in 2004, primarily for server markets.[28] By the late 2000s, demand had declined, leading to a shift post-2010 toward custom manufacturing orders, mainly from Hewlett-Packard (later HPE), which funded continued production to support its Integrity server line.[29] The final new Itanium design, the Kittson series (9700), launched in May 2017 without major architectural changes from Poulson but on a 32 nm process.[30] Intel accepted orders until January 2020, with legacy shipments concluding on July 29, 2021, marking the end of IA-64 processor production.[4]Architecture
Instruction set and bundling
The IA-64 instruction set architecture (ISA) features fixed-length 41-bit instructions, each incorporating a 6-bit predicate field that allows conditional execution based on predicate registers, enabling the compiler to eliminate branches and enhance parallelism. This format supports explicit parallel instruction computing (EPIC), where instructions are designed for hardware-level parallelism without relying on dynamic scheduling. Unlike traditional ISAs with condition codes, IA-64 instructions do not generate or use flags for control flow; instead, predicates provide fine-grained control, reducing branch mispredictions.[31] Instructions are organized into 128-bit bundles, each containing three 41-bit instructions and a 5-bit template field that precedes them. The template specifies the types of instructions in each of the three slots and indicates where execution stops occur, guiding the compiler in grouping independent instructions for parallel issue while respecting dependencies. This bundling mechanism ensures that the hardware can process multiple instructions atomically, with the bundle serving as the basic unit of fetch and dispatch. Bundles are aligned on 16-byte boundaries, and the architecture mandates that instructions cannot span bundle boundaries.[31] The 5-bit template defines one of 13 possible formats, categorized by slot types: M for memory operations (loads and stores), I for integer ALU operations, F for floating-point operations, B for branches, and a wildcard for extended opcodes. Common templates include:| Template | Slot 1 | Slot 2 | Slot 3 | Description |
|---|---|---|---|---|
| MII | M | I | I | Memory followed by two integer operations; common for load-use patterns. |
| MMI | M | M | I | Two memory operations and one integer; allows parallel loads. |
| MFI | M | F | I | Memory, floating-point, and integer; supports mixed data types. |
| MIB | M | I | B | Memory, integer, and branch; facilitates predicated control flow. |
| II | I | I | - | Two integer operations (third slot unused or extended). |
Register architecture
The IA-64 architecture features a large register file designed to support explicit parallelism and software pipelining, with 128 general-purpose registers, 128 floating-point registers, 64 predicate registers, and 8 branch registers, enabling efficient handling of speculative execution and loop optimizations.[32] This organization, including rotating subsets in several register types, scales to accommodate high instruction-level parallelism by reducing the need for frequent memory accesses during procedure calls and iterations.[33] The general-purpose registers (GPRs), denoted as GR0 through GR127 or r0 through r127, consist of 128 64-bit integer registers, each augmented with a Not-a-Thing (NaT) bit for managing speculative exceptions.[32] GR0 (r0) is hardwired to zero on reads and faults on writes, serving as a constant for computations.[34] The registers are divided into a static subset (GR0–GR31 or r0–r31) visible across procedure calls and a rotating subset (GR32–GR127 or r32–r127) managed by the Register Stack Engine (RSE) for stacking during function invocations, with the rotation size configurable in multiples of 8 up to 96 registers per frame via the alloc instruction to support loop unrolling.[32][33] Floating-point registers, labeled FR0 through FR127 or f0 through f127, provide 128 82-bit registers (1 sign bit, 17 exponent bits, and 64 significand bits) that conform to IEEE 754 formats for single-, double-, and double-extended precision operations.[32] FR0 reads as +0.0 and FR1 as +1.0, both read-only, while the remaining registers include a static subset (FR0–FR31 or f0–f31) and a fully rotating subset (FR32–FR127 or f32–f127) to facilitate software pipelining in floating-point intensive loops.[34] Each register includes a NaTVal for speculation, and pairs of registers can be used for 128-bit operations such as quad-precision arithmetic.[33] The floating-point status is controlled by the FPSR application register. Predicate registers, PR0 through PR63 or p0 through p63, comprise 64 one-bit registers organized into eight 8-bit groups (pr0 through pr7) for efficient manipulation in conditional code.[32] PR0 (p0) is always 1 and read-only, used as a default true predicate, while PR16–PR63 (p16–p63) form a rotating subset controlled by the CFM register's rrb.pr field to enable predicated execution across loop iterations.[33] These registers, typically set by compare instructions, allow fine-grained control over instruction execution to minimize branches and enhance parallelism.[34] Branch registers, BR0 through BR7 or b0 through b7, are eight 64-bit static registers dedicated to holding target addresses for indirect branches and calls.[32] BR0 serves as the return pointer for branch calls, with the others available for general use in control flow operations.[34] Application and control registers include up to 128 special-purpose registers (AR0–AR127), such as the eight kernel registers (KR0–KR7) for privileged operations, along with others like RSC for RSE control, PFS for function state, LC and EC for loop counters, and FPSR for floating-point modes.[32] Most are 64-bit and static, with access restricted by privilege levels; for example, KR0–KR7 are writable only at the highest privilege.[33] These registers support system state management and are essential for coordinating the rotating register mechanisms.[34]| Register Type | Number | Width | Key Organization | Special Features |
|---|---|---|---|---|
| General-Purpose (GPRs) | 128 | 64 bits + NaT | 32 static, 96 rotating | R0 = 0; RSE-managed stacking |
| Floating-Point | 128 | 82 bits | 32 static, 96 rotating | IEEE 754 support; F0=0.0, F1=1.0; NaTVal |
| Predicate | 64 | 1 bit | 16 static, 48 rotating | P0=1; 8 groups of 8 bits |
| Branch | 8 | 64 bits | Static | For indirect branches; B0=return link |
| Application/Control | ~128 | Varies (mostly 64 bits) | Static | Privilege controls; e.g., LC/EC for loops |
