Hubbry Logo
Stack machineStack machineMain
Open search
Stack machine
Community hub
Stack machine
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Stack machine
Stack machine
from Wikipedia

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a process virtual machine in which the primary interaction is moving short-lived temporary values to and from a push-down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

Design

[edit]

Most or all stack machine instructions assume that operands will be from the stack, and results placed in the stack. The stack easily holds more than two inputs or more than one result, so a rich set of operations can be computed. In stack machine code (sometimes called p-code), instructions will frequently have only an opcode commanding an operation, with no additional fields identifying a constant, register or memory cell, known as a zero address format.[1] A computer that operates in such a way that the majority of its instructions do not include explicit addresses is said to utilize zero-address instructions.[2] This greatly simplifies instruction decoding. Branches, load immediates, and load/store instructions require an argument field, but stack machines often arrange that the frequent cases of these still fit together with the opcode into a compact group of bits. The selection of operands from prior results is done implicitly by ordering the instructions. Some stack machine instruction sets are intended for interpretive execution of a virtual machine, rather than driving hardware directly.

Integer constant operands are pushed by Push or Load Immediate instructions. Memory is often accessed by separate Load or Store instructions containing a memory address or calculating the address from values in the stack. All practical stack machines have variants of the load–store opcodes for accessing local variables and formal parameters without explicit address calculations. This can be by offsets from the current top-of-stack address, or by offsets from a stable frame-base register.

The instruction set carries out most ALU actions with postfix (reverse Polish notation) operations that work only on the expression stack, not on data registers or main memory cells. This can be very convenient for executing high-level languages because most arithmetic expressions can be easily translated into postfix notation.

Binary syntax tree for the expression A*(BC) + (D+E)

For example, consider the expression A*(BC)+(D+E), written in reverse Polish notation as A B C − * D E + +. Compiling and running this on a simple imaginary stack machine would take the form:

                 # stack contents (leftmost = top = most recent):
 push A          #           A
 push B          #     B     A
 push C          # C   B     A
 subtract        #     B−C   A
 multiply        #           A*(B−C)
 push D          #     D     A*(B−C)
 push E          # E   D     A*(B−C)
 add             #     D+E   A*(B−C)
 add             #           A*(B−C)+(D+E)

The arithmetic operations "subtract", "multiply", and "add" act on the two topmost operands of the stack. The computer takes both operands from the topmost (most recent) values of the stack. The computer replaces those two values with the calculated difference, sum, or product. In other words the instruction's operands are "popped" off the stack, and its result(s) are then "pushed" back onto the stack, ready for the next instruction.

Stack machines may have their expression stack and their call-return stack separated or as one integrated structure. If they are separated, the instructions of the stack machine can be pipelined with fewer interactions and less design complexity, so it will usually run faster.

Optimisation of compiled stack code is quite possible. Back-end optimisation of compiler output has been demonstrated to significantly improve code,[3][4] and potentially performance, whilst global optimisation within the compiler itself achieves further gains.[5]

Stack storage

[edit]

Some stack machines have a stack of limited size, implemented as a register file. The ALU will access this with an index. A large register file uses a lot of transistors and hence this method is only suitable for small systems. A few machines have both an expression stack in memory and a separate register stack. In this case, software, or an interrupt may move data between them. Some machines have a stack of unlimited size, implemented as an array in RAM, which is cached by some number of "top of stack" address registers to reduce memory access. Except for explicit "load from memory" instructions, the order of operand usage is identical with the order of the operands in the data stack, so excellent prefetching can be accomplished easily.

Consider X+1. It compiles to Load X; Load 1; Add. With a stack stored completely in RAM, this does implicit writes and reads of the in-memory stack:

  • Load X, push to memory
  • Load 1, push to memory
  • Pop 2 values from memory, add, and push result to memory

for a total of 5 data cache references.

The next step up from this is a stack machine or interpreter with a single top-of-stack register. The above code then does:

  • Load X into empty TOS register (if hardware machine) or Push TOS register to memory, Load X into TOS register (if interpreter)
  • Push TOS register to memory, Load 1 into TOS register
  • Pop left operand from memory, add to TOS register and leave it there

for a total of 3 data cache references, worst-case. Generally, interpreters don't track emptiness, because they don't have to—anything below the stack pointer is a non-empty value, and the TOS cache register is always kept hot. Typical Java interpreters do not buffer the top-of-stack this way, however, because the program and stack have a mix of short and wide data values.

If the hardwired stack machine has 2 or more top-stack registers, or a register file, then all memory access is avoided in this example and there is only 1 data cache cycle.

History and implementations

[edit]

Description of such a method requiring only two values at a time to be held in registers, with a limited set of pre-defined operands that were able to be extended by definition of further operands, functions and subroutines, was first provided at conference by Robert S. Barton in 1961.[6][7]

Commercial stack machines

[edit]

Examples of stack instruction sets directly executed in hardware include

Virtual stack machines

[edit]

Examples of virtual stack machines interpreted in software:

Hybrid machines

[edit]

Pure stack machines are quite inefficient for procedures which access multiple fields from the same object. The stack machine code must reload the object pointer for each pointer+offset calculation. A common fix for this is to add some register-machine features to the stack machine: a visible register file dedicated to holding addresses, and register-style instructions for doing loads and simple address calculations. It is uncommon to have the registers be fully general purpose, because then there is no strong reason to have an expression stack and postfix instructions.

Another common hybrid is to start with a register machine architecture, and add another memory address mode which emulates the push or pop operations of stack machines: "memaddress = reg; reg += instr.displ". This was first used in DEC's PDP-11 minicomputer.[24] This feature was carried forward in VAX computers and in Motorola 6809 and 68000 series microprocessors. This allowed the use of simpler stack methods in early compilers. It also efficiently supported virtual machines using stack interpreters or threaded code. However, this feature did not help the register machine's own code to become as compact as pure stack machine code. Also, the execution speed was less than when compiling well to the register architecture. It is faster to change the top-of-stack pointer only occasionally (once per call or return) rather than constantly stepping it up and down throughout each program statement, and it is even faster to avoid memory references entirely.

More recently, so-called second-generation stack machines have adopted a dedicated collection of registers to serve as address registers, off-loading the task of memory addressing from the data stack. For example, MuP21 relies on a register called "A", while the more recent GreenArrays processors relies on two registers: A and B.[25]

The Intel x86 family of microprocessors have a register-style (accumulator) instruction set for most operations, but use stack instructions for its x87, Intel 8087 floating point arithmetic, dating back to the iAPX87 (8087) coprocessor for the 8086 and 8088. That is, there are no programmer-accessible floating point registers, but only an 80-bit wide, 8-level deep stack. The x87 relies heavily on the x86 CPU for assistance in performing its operations.

Computers using call stacks and stack frames

[edit]

Most current computers (of any instruction set style) and most compilers use a large call-return stack in memory to organize the short-lived local variables and return links for all currently active procedures or functions. Each nested call creates a new stack frame in memory, which persists until that call completes. This call-return stack may be entirely managed by the hardware via specialized address registers and special address modes in the instructions. Or it may be merely a set of conventions followed by the compilers, using generic registers and register+offset address modes. Or it may be something in between.

Since this technique is now nearly universal, even on register machines, it is not helpful to refer to all these machines as stack machines. That term is commonly reserved for machines which also use an expression stack and stack-only arithmetic instructions to evaluate the pieces of a single statement.

Computers commonly provide direct, efficient access to the program's global variables and to the local variables of only the current innermost procedure or function, the topmost stack frame. "Up level" addressing of the contents of callers' stack frames is usually not needed and not supported as directly by the hardware. If needed, compilers support this by passing in frame pointers as additional, hidden parameters.

Some Burroughs stack machines do support up-level refs directly in the hardware, with specialized address modes and a special "display" register file holding the frame addresses of all outer scopes. Currently, only MCST Elbrus has done this in hardware. When Niklaus Wirth developed the first Pascal compiler for the CDC 6000 series, he found that it was faster overall to pass in the frame pointers as a chain, rather than constantly updating complete arrays of frame pointers. This software method also adds no overhead for common languages like C which lack up-level refs.

The same Burroughs machines also supported nesting of tasks or threads. The task and its creator share the stack frames that existed at the time of task creation, but not the creator's subsequent frames nor the task's own frames. This was supported by a cactus stack, whose layout diagram resembled the trunk and arms of a Saguaro cactus. Each task had its own memory segment holding its stack and the frames that it owns. The base of this stack is linked to the middle of its creator's stack. In machines with a conventional flat address space, the creator stack and task stacks would be separate heap objects in one heap.

In some programming languages, the outer-scope data environments are not always nested in time. These languages organize their procedure "activation records" as separate heap objects rather than as stack frames appended to a linear stack.

In simple languages like Forth that lack local variables and naming of parameters, stack frames would contain nothing more than return branch addresses and frame management overhead. So their return stack holds bare return addresses rather than frames. The return stack is separate from the data value stack, to improve the flow of call setup and returns.

Comparison with register machines

[edit]

Stack machines are often compared to register machines, which hold values in an array of registers. Register machines may store stack-like structures in this array, but a register machine has instructions which circumvent the stack interface. Register machines routinely outperform stack machines,[26] and stack machines have remained a niche player in hardware systems. But stack machines are often used in implementing virtual machines because of their simplicity and ease of implementation.[27]

Instructions

[edit]

Stack machines have higher code density. In contrast to common stack machine instructions which can easily fit in 6 bits or less, register machines require two or three register-number fields per ALU instruction to select operands; the densest register machines average about 16 bits per instruction plus the operands. Register machines also use a wider offset field for load-store opcodes. A stack machine's compact code naturally fits more instructions in cache, and therefore could achieve better cache efficiency, reducing memory costs or permitting faster memory systems for a given cost. In addition, most stack-machine instructions are very simple, made from only one opcode field or one operand field. Thus, stack-machines require very little electronic resources to decode each instruction.

A program has to execute more instructions when compiled to a stack machine than when compiled to a register machine or memory-to-memory machine. Every variable load or constant requires its own separate Load instruction, instead of being bundled within the instruction which uses that value. The separated instructions may be simple and faster running, but the total instruction count is still higher.

Most register interpreters specify their registers by number. But a host machine's registers can't be accessed in an indexed array, so a memory array is allotted for virtual registers. Therefore, the instructions of a register interpreter must use memory for passing generated data to the next instruction. This forces register interpreters to be much slower on microprocessors made with a fine process rule (i.e. faster transistors without improving circuit speeds, such as the Haswell x86). These require several clocks for memory access, but only one clock for register access. In the case of a stack machine with a data forwarding circuit instead of a register file, stack interpreters can allot the host machine's registers for the top several operands of the stack instead of the host machine's memory

In a stack machine, the operands used in the instructions are always at a known offset (set in the stack pointer), from a fixed location (the bottom of the stack, which in a hardware design might always be at memory location zero), saving precious in-cache or in-CPU storage from being used to store quite so many memory addresses or index numbers. This may preserve such registers and cache for use in non-flow computation.[citation needed]

Temporary / local values

[edit]

Some in the industry believe that stack machines execute more data cache cycles for temporary values and local variables than do register machines.[28]

On stack machines, temporary values often get spilled into memory, whereas on machines with many registers these temps usually remain in registers. (However, these values often need to be spilled into "activation frames" at the end of a procedure's definition, basic block, or at the very least, into a memory buffer during interrupt processing). Values spilled to memory add more cache cycles. This spilling effect depends on the number of hidden registers used to buffer top-of-stack values, upon the frequency of nested procedure calls, and upon host computer interrupt processing rates.

On register machines using optimizing compilers, it is very common for the most-used local variables to remain in registers rather than in stack frame memory cells. This eliminates most data cache cycles for reading and writing those values. The development of "stack scheduling" for performing live-variable analysis, and thus retaining key variables on the stack for extended periods, helps this concern.[3][4][5]

On the other hand, register machines must spill many of their registers to memory across nested procedure calls. The decision of which registers to spill, and when, is made statically at compile time rather than on the dynamic depth of the calls. This can lead to more data cache traffic than in an advanced stack machine implementation.

Common subexpressions

[edit]

In register machines, a common subexpression (a subexpression which is used multiple times with the same result value) can be evaluated just once and its result saved in a fast register. The subsequent reuses have no time or code cost, just a register reference. This optimization speeds simple expressions (for example, loading variable X or pointer P) as well as less-common complex expressions.

With stack machines, in contrast, results can be stored in one of two ways. Firstly, results can be stored using a temporary variable in memory. Storing and subsequent retrievals cost additional instructions and additional data cache cycles. Doing this is only a win if the subexpression computation costs more in time than fetching from memory, which in most stack CPUs, almost always is the case. It is never worthwhile for simple variables and pointer fetches, because those already have the same cost of one data cache cycle per access. It is only marginally worthwhile for expressions such as X+1. These simpler expressions make up the majority of redundant, optimizable expressions in programs written in languages other than concatenative languages. An optimizing compiler can only win on redundancies that the programmer could have avoided in the source code.[citation needed]

The second way leaves a computed value on the data stack, duplicating it as needed. This uses operations to copy stack entries. The stack must be depth shallow enough for the CPU's available copy instructions. Hand-written stack code often uses this approach, and achieves speeds like general-purpose register machines.[29][9] Unfortunately, algorithms for optimal "stack scheduling" are not in wide use by programming languages.

Pipelining

[edit]

In modern machines, the time to fetch a variable from the data cache is often several times longer than the time needed for basic ALU operations. A program runs faster without stalls if its memory loads can be started several cycles before the instruction that needs that variable. Complex machines can do this with a deep pipeline and "out-of-order execution" that examines and runs many instructions at once. Register machines can even do this with much simpler "in-order" hardware, a shallow pipeline, and slightly smarter compilers. The load step becomes a separate instruction, and that instruction is statically scheduled much earlier in the code sequence. The compiler puts independent steps in between.

Scheduling memory accesses requires explicit, spare registers. It is not possible on stack machines without exposing some aspect of the micro-architecture to the programmer. For the expression A B −, B must be evaluated and pushed immediately prior to the Minus step. Without stack permutation or hardware multithreading, relatively little useful code can be put in between while waiting for the Load B to finish. Stack machines can work around the memory delay by either having a deep out-of-order execution pipeline covering many instructions at once, or more likely, they can permute the stack such that they can work on other workloads while the load completes, or they can interlace the execution of different program threads, as in the Unisys A9 system.[30] Today's increasingly parallel computational loads suggests, however, this might not be the disadvantage it's been made out to be in the past.

Stack machines can omit the operand fetching stage of a register machine.[29] For example, in the Java Optimized Processor (JOP) microprocessor the top 2 operands of stack directly enter a data forwarding circuit that is faster than the register file.[31]

Out-of-order execution

[edit]

The Tomasulo algorithm finds instruction-level parallelism by issuing instructions as their data becomes available. Conceptually, the addresses of positions in a stack are no different than the register indexes of a register file. This view permits the out-of-order execution of the Tomasulo algorithm to be used with stack machines.

Out-of-order execution in stack machines seems to reduce or avoid many theoretical and practical difficulties.[32] The cited research shows that such a stack machine can exploit instruction-level parallelism, and the resulting hardware must cache data for the instructions. Such machines effectively bypass most memory accesses to the stack. The result achieves throughput (instructions per clock) comparable to load–store architecture machines, with much higher code densities (because operand addresses are implicit).

One issue brought up in the research was that it takes about 1.88 stack-machine instructions to do the work of one instruction on a load-store architecture machine.[citation needed] Competitive out-of-order stack machines therefore require about twice as many electronic resources to track instructions ("issue stations"). This might be compensated by savings in instruction cache and memory and instruction decoding circuits.

Hides a faster register machine inside

[edit]

Some simple stack machines have a chip design which is fully customized all the way down to the level of individual registers. The top of stack address register and the N top of stack data buffers are built from separate individual register circuits, with separate adders and ad hoc connections.

However, most stack machines are built from larger circuit components where the N data buffers are stored together within a register file and share read/write buses. The decoded stack instructions are mapped into one or more sequential actions on that hidden register file. Loads and ALU ops act on a few topmost registers, and implicit spills and fills act on the bottommost registers. The decoder allows the instruction stream to be compact. But if the code stream instead had explicit register-select fields which directly manipulated the underlying register file, the compiler could make better use of all registers and the program would run faster.

Microprogrammed stack machines are an example of this. The inner microcode engine is some kind of RISC-like register machine or a VLIW-like machine using multiple register files. When controlled directly by task-specific microcode, that engine gets much more work completed per cycle than when controlled indirectly by equivalent stack code for that same task.

The object code translators that translated code for the HP 3000 and Tandem NonStop stack machines to code for the register-based RISC replacements for those machines are another example.[33][34] They translated stack code sequences into equivalent sequences of RISC code. Minor "local" optimizations removed much of the overhead of a stack architecture. Spare registers were used to factor out repeated address calculations. The translated code still retained plenty of emulation overhead from the mismatch between original and target machines. Despite that burden, the cycle efficiency of the translated code matched the cycle efficiency of the original stack code. And when the source code was recompiled directly to the register machine via optimizing compilers, the efficiency doubled. This shows that the stack architecture and its non-optimizing compilers were wasting over half of the power of the underlying hardware.

Register files are good tools for computing because they have high bandwidth and very low latency, compared to memory references via data caches. In a simple machine, the register file allows reading two independent registers and writing of a third, all in one ALU cycle with one-cycle or less latency. Whereas the corresponding data cache can start only one read or one write (not both) per cycle, and the read typically has a latency of two ALU cycles. That's one third of the throughput at twice the pipeline delay. In a complex machine like Athlon that completes two or more instructions per cycle, the register file allows reading of four or more independent registers and writing of two others, all in one ALU cycle with one-cycle latency. Whereas the corresponding dual-ported data cache can start only two reads or writes per cycle, with multiple cycles of latency. Again, that's one third of the throughput of registers. It is very expensive to build a cache with additional ports.

Since a stack is a component of most software programs, even when the software used is not strictly a stack machine, a hardware stack machine might more closely mimic the inner workings of its programs. Processor registers have a high thermal cost, and a stack machine might claim higher energy efficiency.[35]

Interrupts

[edit]

Responding to an interrupt involves saving the registers to a stack, and then branching to the interrupt handler code. Often stack machines respond more quickly to interrupts, because most parameters are already on a stack and there is no need to push them there. Some register machines deal with this by having multiple register files that can be instantly swapped[36] but this increases costs and slows down the register file.

Interpreters

[edit]

Interpreters for virtual stack machines are easier to build than interpreters for register machines; the logic for handling memory address modes is in just one place rather than repeated in many instructions. Stack machines also tend to have fewer variations of an opcode; one generalized opcode will handle both frequent cases and obscure corner cases of memory references or function call setup. (But code density is often improved by adding short and long forms for the same operation.)

Interpreters for virtual stack machines are often slower than interpreters for other styles of virtual machine.[37] This slowdown is worst when running on host machines with deep execution pipelines, such as current x86 chips.

In some interpreters, the interpreter must execute a N-way switch jump to decode the next opcode and branch to its steps for that particular opcode. Another method for selecting opcodes is threaded code. The host machine's prefetch mechanisms are unable to predict and fetch the target of that indexed or indirect jump. So the host machine's execution pipeline must restart each time the hosted interpreter decodes another virtual instruction. This happens more often for virtual stack machines than for other styles of virtual machine.[38]

One example is the Java programming language. Its canonical virtual machine is specified as an 8-bit stack machine. However, the Dalvik virtual machine for Java used on Android smartphones is a 16-bit virtual-register machine—a choice made for efficiency reasons. Arithmetic instructions directly fetch or store local variables via 4-bit (or larger) instruction fields.[39] Similarly version 5.0 of Lua replaced its virtual stack machine with a faster virtual register machine.[40][41]

Since Java virtual machine became popular, microprocessors have employed advanced branch predictors for indirect jumps.[42] This advance avoids most of pipeline restarts from N-way jumps and eliminates much of the instruction count costs that affect stack interpreters.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A stack machine is a computer architecture that employs a last-in, first-out (LIFO) stack as its central mechanism for managing operands and intermediate results, with instructions operating primarily on the top stack elements via push, pop, and arithmetic operations that implicitly reference these positions rather than explicit addresses. This design, often using reverse Polish notation for postfix expression evaluation, simplifies instruction encoding to zero-operand formats, reducing hardware complexity while facilitating efficient execution of nested operations common in programming languages. Key components typically include a data stack for operands, a return stack for subroutine management, an arithmetic logic unit (ALU) to process top-of-stack values, a program counter, and control logic to sequence instructions. The concept of stack machines emerged in the late 1950s amid debates over register versus stack-based processing, with early implementations influenced by work on compiler design and expression evaluation. Friedrich L. Bauer and Klaus Samelson are credited with inventing the stack principle in 1957 as part of their contributions to programming language implementation, patenting it as a mechanism for handling nested procedure calls and arithmetic in ALGOL compilers. The influential Burroughs B5000 in 1961 was one of the first commercial stack machines, followed by the English Electric KDF9 in 1962, which featured a memory-resident stack, segmented virtual memory, and hardware support for high-level languages like ALGOL and COBOL without assembly code. In modern computing, stack machines persist primarily as virtual architectures, exemplified by the Java Virtual Machine (JVM), where a per-thread operand stack stores local variables, partial computations, and method arguments during bytecode execution. This approach enables platform-independent code generation and simplifies verification for security, as seen in the JVM's lack of general-purpose registers for intermediate values. Stack machines offer advantages such as compact instruction sets, easier compiler optimization for register allocation, and consistent performance across diverse hardware, though they may incur overhead from frequent stack manipulations compared to register-rich designs.

Fundamentals

Definition and Principles

A stack machine is a type of computer architecture or abstract machine that employs a last-in, first-out (LIFO) stack as the primary structure for storing and manipulating operands during arithmetic and logical operations. In this model, instructions either push constants or operands onto the stack or pop values from it to execute computations, with results subsequently pushed back onto the stack for further use. This approach contrasts with register-based architectures by centralizing temporary data handling in the stack rather than distributed registers. The fundamental principles of stack machines revolve around postfix notation, also known as reverse Polish notation (RPN), where operators follow their operands, enabling straightforward evaluation without parentheses or operator precedence rules. Expressions are thus processed sequentially, with the stack implicitly managing operand order and temporaries; there is no need for explicit register addressing, as the stack itself serves as the core storage for immediate computations. This LIFO discipline ensures that the most recent operand is always accessible at the top of the stack, facilitating efficient operation execution. Key advantages of stack machines include their inherent simplicity in instruction decoding, as many operations require no operand fields and can be executed in a single cycle by directly accessing the top-of-stack elements. This design also reduces hardware complexity for arithmetic logic units (ALUs), since computations operate solely on stack tops without intricate addressing hardware, allowing more resources to be allocated to memory or specialized functions. For instance, adding two values A and B involves pushing A, pushing B, and issuing an "add" instruction that pops both, computes the sum, and pushes the result—far simpler than a register machine's sequence of loading A into R1, B into R2, and adding R1 to R2 into R3. In theoretical terms, stack machines align closely with pushdown automata, computational models that augment finite automata with a stack to recognize context-free languages, demonstrating the stack's power in handling nested structures beyond regular languages.

Core Operations

In a stack machine, computations are performed using a last-in, first-out (LIFO) stack as the primary data structure for holding operands and intermediate results, with instructions that implicitly reference the top elements without explicit addressing. The core operations revolve around manipulating this stack to execute arithmetic, logical, and control tasks. Primary among these are the push and pop instructions, which manage data entry and retrieval. The push operation loads a constant value or an address onto the top of the stack, effectively adding a new element for subsequent use. Conversely, the pop operation removes the top element from the stack, typically to store it in memory, use it in a computation, or discard it, thereby freeing the stack position. These operations form the foundation for all data flow, as stack machines rely entirely on stack depth to manage temporaries rather than dedicated registers for operands. Binary operations, such as addition, subtraction, and multiplication, operate on the two topmost stack elements. These instructions pop two values from the stack—first the top (second operand) and then the next (first operand)—apply the specified operation, and push the resulting value back onto the stack. For instance, an add instruction would pop values bb and aa, compute a+ba + b, and push the sum. Unary operations, like negation or duplication, affect only the top stack element: they pop one value, perform the operation (e.g., negating it to a-a), and push the result. Duplication, a common unary operation, pops the top value and pushes it back twice, effectively copying it for reuse without additional storage. These operations ensure efficient evaluation of expressions in postfix (reverse Polish) notation, where operators follow their operands. Control flow instructions extend stack manipulation to program execution paths. Unconditional jumps alter the program counter directly to a specified address, while conditional branches test the top stack element—often comparing it to zero or checking its sign—and pop it before branching if the condition holds. For example, a branch-if-zero instruction pops the top value and jumps if it equals zero, enabling decisions based on prior computations. Subroutine calls push the return address onto a separate return stack and jump to the target, with returns popping the address to resume execution. These mechanisms integrate stack-based data with procedural control. Stack underflow occurs when a pop or operation attempts to access elements beyond the stack's current depth, while overflow happens upon pushing beyond the allocated capacity; both are inherent risks in the design and typically trigger traps, halts, or errors to prevent invalid states. Handling may involve ignoring the condition in simple implementations or dynamically extending the stack via memory allocation in more robust systems. A representative example is evaluating the infix expression 2+3×42 + 3 \times 4, which in postfix notation becomes 2 3 4  +2\ 3\ 4\ *\ +. The sequence of operations is as follows:

push 2 // Stack: [2] push 3 // Stack: [2, 3] push 4 // Stack: [2, 3, 4] mul // Pop 4 and 3, push 12; Stack: [2, 12] add // Pop 12 and 2, push 14; Stack: [14]

push 2 // Stack: [2] push 3 // Stack: [2, 3] push 4 // Stack: [2, 3, 4] mul // Pop 4 and 3, push 12; Stack: [2, 12] add // Pop 12 and 2, push 14; Stack: [14]

This demonstrates how stack operations directly mirror postfix evaluation, with the final result left on the stack top.

Design Features

Stack Organization

In stack machines, the stack serves as the primary memory region for temporary data storage and operand manipulation, typically implemented as a contiguous array within the main RAM to facilitate efficient last-in-first-out (LIFO) access. A dedicated stack pointer (SP) register tracks the address of the top of the stack, pointing to the most recently added element or the next available location depending on the design. Stack growth direction varies by architecture; for instance, the seminal Burroughs B5000 employs an upward-growing stack for local variables, where the SP increments to allocate new space, while many other systems, including modern processors with stack support, grow the stack downward to simplify collision avoidance with the heap. Stack depth management balances fixed and dynamic sizing to accommodate computation needs without excessive hardware complexity. In fixed-depth designs, the stack is confined to a predetermined buffer size, often holding the top few elements (e.g., two in the B5000's A and B registers) to accelerate operations while spilling to memory as needed. Dynamic sizing allows the stack to expand into available memory, managed by the SP, though some architectures incorporate a frame pointer (FP) register to delineate subroutine boundaries and preserve activation records during recursive or nested calls. This FP-SP pairing enables reliable restoration of prior stack states upon subroutine return, mitigating risks in deeper call hierarchies. Access to stack elements occurs indirectly through the SP, enforcing sequential push and pop operations without support for random indexing, which promotes disciplined data flow and simplifies instruction decoding. Hardware typically automates SP adjustments, incrementing or decrementing it during pushes and pops to maintain integrity, often in a single clock cycle for performance. In the B5000, for example, the top two stack elements reside in dedicated A and B registers for direct arithmetic, with memory access routed via the SP only when the stack depth exceeds this buffer. Key limitations include a fixed maximum depth imposed by available memory or hardware buffers, which guards against infinite recursion by triggering overflow conditions if exceeded. Overflow detection relies on hardware mechanisms such as guard pages or SP boundary checks, generating exceptions when the SP approaches predefined limits to prevent corruption of adjacent memory regions. For a representative 32-bit stack machine, entries are stored as 32-bit words in memory, with the SP itself implemented as a 32-bit address register to span the address space efficiently.

Instruction Design

In stack machines, instructions are typically designed with a short, fixed-length format to promote simplicity and efficient decoding. The core of each instruction is an opcode, often 8 to 16 bits long, which specifies the operation without explicit operand fields, as operands are implicitly provided by the top elements of the stack. This zero-operand approach eliminates the need for destination specifiers or register indices, reducing instruction complexity compared to register-based architectures. Immediate values, when required for operations like pushing constants onto the stack, may be embedded directly in the instruction for small values (e.g., 8-bit immediates) or handled via multi-word literals for larger constants. Such formats enable fixed-length instructions, which streamline hardware decoding while maintaining compact encoding. Opcodes in stack machines are categorized to cover essential computational needs, with arithmetic and logical operations forming a primary group (e.g., ADD, SUB, AND, OR), typically encoded in 1 to 2 bytes. Control flow instructions handle branches and jumps, while load and store operations manage data movement between the stack and memory, often using the stack pointer to imply addresses. Stack manipulation opcodes, such as DUP (duplicate top element) or SWAP (exchange top two elements), further optimize common patterns without additional operands. These categories leverage the stack's last-in, first-out structure, where the encoding efficiency arises from treating stack depth as an implicit register file—no explicit addressing modes are needed for most operations, allowing opcodes to focus solely on the action. Variations in instruction design emphasize zero-operand dominance for core operations, though some architectures incorporate limited immediates to handle small constants efficiently, avoiding separate load instructions. This approach supports postfix (Reverse Polish) notation in code generation, which enhances density by eliminating parentheses and operator precedence rules—for instance, the expression (A + B) * C translates directly to PUSH A, PUSH B, ADD, PUSH C, MUL without infix parsing overhead. Overall, these designs yield higher code density than register machines, as instructions remain short and uniform. However, trade-offs include simpler decoder hardware due to the lack of operand decoding, balanced against the potential for deeper stack usage in complex expressions, which can increase memory pressure.

Historical Context

Origins and Early Machines

The theoretical foundations of stack machines emerged in the mid-1950s amid efforts to simplify expression evaluation in programming languages and compilers, drawing on postfix (reverse Polish) notation to avoid the complexities of algebraic infix notation with parentheses. This notation, originally developed by Jan Łukasiewicz in the 1920s, gained traction in computing for its unambiguous, stack-friendly structure where operands precede operators, enabling straightforward push-pop operations. In 1957, Friedrich L. Bauer and Klaus Samelson developed the stack principle as part of their work on ALGOL compilers, patenting it as a mechanism for handling nested procedure calls and arithmetic. Independently in the same year, Australian computer scientist Charles L. Hamblin proposed the stack concept—termed a "running accumulator"—as a mechanism for efficient postfix evaluation, describing it in his paper on addressless coding schemes based on mathematical notation. Hamblin's ideas emphasized stacks for arithmetic and logical operations without explicit addressing, influencing early compiler designs by reducing the need for complex register management. These concepts intersected with work on recursive function evaluation, notably John McCarthy's 1960 design for the Lisp evaluator, which relied on stack-like structures to handle nested expressions and dynamic scoping in lambda calculus interpretations. McCarthy's approach, outlined in his seminal paper on recursive symbolic functions, demonstrated how stacks could support the applicative-order evaluation of lambda terms, providing a computational model for list processing and function application without traditional registers. This theoretical linkage highlighted stacks' suitability for handling recursion and higher-order functions, bridging pure mathematics and practical machine design. Key innovators at Burroughs Corporation, including research director Lyle R. Johnson, drew on such ideas to prioritize hardware support for high-level languages like ALGOL, aiming to eliminate low-level addressing burdens. The first hardware realization came with the Burroughs B5000 in 1961, marking the debut of a commercial stack machine featuring a hardware-implemented operand stack integrated into main memory, along with a tagged architecture for data type enforcement and bounds checking. The B5000's design used a deep stack for expression evaluation and procedure calls, with the top two elements cached in CPU registers, enabling zero-operand instructions that operated directly on stack tops. Early academic prototypes in the late 1950s, such as the IPL-VI system at Rand Corporation and the ALCOR project in Europe, experimented with stack-based evaluation for ALGOL parsing and parameter passing, validating theoretical concepts on limited hardware. Influenced by Hamblin's work, the English Electric KDF-9 (1964) implemented dual stacks—one for operands and one for control—serving as a transitional prototype toward full stack architectures. By the 1970s, stack principles extended to minicomputers for their simplicity in code generation and resource management, as seen in Hewlett-Packard's HP 3000 series introduced in 1972. The HP 3000 employed a 16-bit stack-based processor without general-purpose registers, using multiple stacks for operands, locals, and control to support multitasking and virtual memory, which streamlined implementation of languages like COBOL and SPL. This evolution underscored stacks' role in making computing more accessible beyond mainframes, prioritizing conceptual elegance over raw speed in mid-sized systems.

Commercial Stack Machines

The Burroughs Corporation developed some of the earliest commercial stack machines, beginning with the B5500 introduced in 1964 as a refinement of the 1961 B5000 design. This system employed a pure stack architecture where arithmetic and logical operations were performed exclusively on operands at the top of the stack, with hardware automatically handling push and pop actions to reduce instruction count and code density. Descriptors provided tagged data representation for type checking and bounds protection, enabling direct support for high-level languages such as ALGOL 60 and COBOL without assembly-level intervention. The B6700, released in 1971, extended this foundation with a tagged architecture that integrated hardware garbage collection, assigning each process a private evaluation stack shared with a garbage-collected heap for dynamic memory allocation. Stack elements included status bits to indicate valid data positions, and the design emphasized expression evaluation alongside procedure activation, where return addresses and parameters were interleaved on the stack. This hardware-software synergy supported multitasking and virtual memory, marking a significant advancement in commercial mainframe reliability for business and scientific applications. Several other vendors produced commercial machines incorporating stack elements during the 1960s and 1970s. The CDC 6600 supercomputer from 1964 utilized a partial stack for instruction prefetching, consisting of eight 60-bit registers (I0-I7) functioning as a push-up stack to buffer instruction loops and minimize central memory accesses for short sequences. Key to the commercial viability of these machines, particularly Burroughs systems, was integrated operating system support; the Master Control Program (MCP) employed stack frames to encapsulate procedure activations, local variables, and control information, automating context switching and resource allocation without explicit programmer management of registers or memory. Commercial adoption of pure stack hardware peaked in the 1970s but declined sharply by the 1980s, as reduced instruction set computing (RISC) architectures favoring explicit register operations gained prominence for their pipelining efficiency and compiler optimizations. Burroughs MCP-based systems represented the last major hardware implementations, persisting in production through the 1990s for legacy enterprise workloads, with many decommissioned post-2000 as vendors shifted to emulated or hybrid platforms.

Implementations

Virtual Stack Machines

Virtual stack machines are software-based emulations of stack architectures that operate within host environments, such as operating systems or runtime systems, to execute bytecode or intermediate representations without dedicated hardware support. These machines typically simulate stack operations through software constructs like arrays or linked lists managed by loops in the host processor's instruction set, enabling the push, pop, and manipulation of operands in a virtual operand stack. This design promotes portability across diverse hardware platforms by abstracting low-level details, while also enhancing security through mechanisms like type safety verification that prevent invalid operations at runtime. The Java Virtual Machine (JVM), introduced in 1995 as part of the Java platform, exemplifies a prominent virtual stack machine that processes stack-based bytecode instructions. Each method frame in the JVM maintains a dedicated operand stack for temporary values, with operations like iadd (integer addition) consuming two operands from the stack and pushing the result back. Bytecode verification occurs at load time to ensure type safety and stack discipline, rejecting code that could overflow the stack or violate type constraints. This approach allows Java applications to run portably on any host with a compatible JVM implementation. Similarly, the .NET Common Language Runtime (CLR), released in 2002 with the .NET Framework, employs a stack-based evaluation stack in its Common Intermediate Language (CIL) for managed code execution. CIL instructions, such as add.ovf for overflow-checked addition, operate on this virtual stack to handle operand passing and computation results, with just-in-time (JIT) compilation translating them to native code on the host CPU. The CLR's typed stack model supports secure execution by enforcing type contracts and bounds checking during verification. Forth virtual machines provide interactive stack interpreters particularly suited for embedded systems, where the host CPU emulates a data stack and return stack through software routines. These interpreters execute Forth words—primitive or composed operations—by manipulating the stacks in an immediate-execution mode, allowing real-time interaction and extension via user-defined definitions. Their lightweight emulation via simple loops makes them ideal for resource-constrained environments like microcontrollers. The evolution of virtual stack machines surged in the post-1990s era with the rise of managed runtime environments for scripting and application development. Early implementations, such as Python's CPython evaluator introduced in 1991 but maturing through the 1990s, adopted a stack-based virtual machine to interpret bytecode, using an evaluation stack for expression computation and control flow. This period saw broader adoption for portability in cross-platform scripting, culminating in standardized virtual machines that balanced performance with safety. WebAssembly (Wasm), standardized in 2017 by the W3C, represents a modern virtual stack machine optimized for browser execution and beyond, employing a structured stack model for computations while using linear memory—a contiguous byte array—for data storage and access. Wasm modules encapsulate functions, memories, and tables, with linking mechanisms allowing instantiation and import/export of components across modules for composable execution. Its stack-oriented instructions, verified for safety, enable secure, high-performance code from languages like C++ or Rust in web environments.

Hybrid and Software-Based Machines

Hybrid stack machines integrate stack-based principles with register-oriented architectures to leverage the simplicity of stacks for specific operations, such as function calls and control flow, while retaining the performance benefits of registers for general computation. In these designs, a dedicated hardware stack pointer manages the call stack for subroutine invocation and return, but the majority of operands are handled via registers, creating a balanced approach that avoids the limitations of pure stack machines. A prominent example is the x86 architecture, which employs a hardware stack pointer (RSP in x86-64) to implement the call stack for function management, pushing return addresses and local variables onto the stack during procedure calls while relying heavily on general-purpose registers for operand processing. This hybrid model supports efficient context switching in operating systems, where the stack pointer facilitates rapid saving and restoring of execution state. Similarly, the PDP-11 used register R6 as a stack pointer for interrupts, traps, and subroutine calls, with R5 serving as a frame pointer to delineate stack frames containing local variables and parameters, blending stack discipline with its eight general registers. In CISC architectures like the VAX, software-emulated stacks complement the register-heavy design, particularly for expression evaluation and procedure linkage; the VAX instruction set includes stack manipulation operations that are often implemented via microcode or software routines to handle operand flows not natively suited to its 16 general registers. This emulation allows flexible stack-based computation within a broader register framework, enhancing compatibility with high-level languages that favor stack-like semantics for arithmetic expressions. Modern hybrid implementations extend these concepts into embedded and secure environments. In ARM's TrustZone technology, stack isolation enforces separation between secure and non-secure worlds, where dedicated secure stacks prevent unauthorized access to sensitive execution contexts, providing hardware-enforced partitioning in mobile and IoT devices. For instance, ARMv8-M's stack sealing extension, introduced in 2016, protects against stack overflows by restricting non-secure access to secure stack regions, a critical feature in resource-constrained systems. In graphics processing units (GPUs), the Single Instruction, Multiple Threads (SIMT) execution model employs stack-like structures for control flow reconvergence; warps of threads share a program counter but use a reconvergence stack to track divergent paths, enabling efficient parallel execution akin to operand stack management in traditional stack machines. Post-2010 developments in embedded architectures, such as RISC-V, incorporate stack operations through software conventions and optional extensions that enhance stack efficiency without dedicated hardware stacks; the RISC-V ABI designates a stack pointer (x2) for frame management, while compressed instruction extensions (like C, ratified in 2019) reduce code size for stack-intensive routines in mobile and low-power devices. The call stack, focused on subroutine management via activation records (including return addresses and locals), differs from the operand stack used for temporary computation in pure stack machines; hybrids typically dedicate the call stack to control flow while relegating operand handling to registers, avoiding contention and improving locality. This distinction is evident in OS kernels, where the call stack enables recursive and interrupt-driven execution. Hybrid designs offer advantages in balancing stack simplicity—reducing instruction complexity for nested operations—with register speed for frequent data access, resulting in denser code and better pipelining in mixed workloads; they are prevalent in kernels for their support of modular, reentrant code without full stack overhead. Security implications include stack canaries, randomized values inserted between buffers and return addresses in hybrid call stacks to detect overflows, a technique widely adopted since the late 1990s and refined in modern compilers for architectures like x86 and ARM to mitigate buffer overflow exploits.

Theoretical and Practical Comparisons

Versus Register Machines: Instructions and Values

Stack machines and register machines differ fundamentally in their instruction sets and the management of temporary values, leading to distinct design trade-offs in code generation and execution semantics. In stack machines, instructions are typically zero-operand operations that implicitly operate on the top elements of the operand stack, such as an ADD instruction that pops two values, adds them, and pushes the result back onto the stack. This design simplifies instruction encoding by eliminating the need to specify operand locations, as the stack's last-in, first-out (LIFO) structure dictates the order of operations. In contrast, register machines employ multi-operand instructions that explicitly name source and destination registers, for example, ADD R1, R2, R3, which adds the contents of R2 and R3 and stores the result in R1. This explicit addressing allows for greater flexibility in value manipulation but requires additional bits in the instruction format to encode register identifiers. Temporary values in stack machines are managed through stack depth rather than named locations, meaning local variables and intermediates are positioned relative to the stack top without explicit naming. This approach avoids the need for register allocation during compilation but can lead to inefficiencies when reusing values, as the stack must be manipulated to access deeper elements, often requiring additional push and pop operations. Register machines, however, store temporary values in explicitly named registers, enabling direct reuse without memory accesses or stack adjustments, which reduces the overall number of load and store instructions in the code. For instance, in handling common subexpressions like repeated use of a constant or intermediate result, stack machines must duplicate values by pushing copies onto the stack, incurring extra operations and potential space overhead. Registers facilitate free reuse of such values, as they remain accessible until overwritten, promoting more efficient code for expressions with shared computations. These differences also impact instruction encoding and code density. Stack machine code tends to be denser for arithmetic expressions due to the compact zero-operand format, often resulting in shorter programs—studies indicate stack-based programs can be 2.5 to 8 times smaller than those for conventional register-based CISC architectures. However, stack machines become more verbose when addressing memory or variables, as explicit load and store instructions are needed to manage stack depth. Register machines require more bits per instruction for operand specifiers, leading to larger code sizes—register-based virtual machine bytecode is typically about 25% larger than stack-based equivalents—but this is offset by fewer instructions overall for complex operations. A concrete example illustrates these trade-offs in compiling the expression a + b * c. In a stack machine, the sequence might be: push a, push b, push c, mul (pops b and c, pushes product), add (pops a and product, pushes sum). This requires five instructions, with implicit operand handling via the stack. For a register machine with three registers, it could be: load a into R1, load b into R2, load c into R3, mul R2 and R3 (storing in R2), add R1 and R2 (storing in R1). Here, five instructions are used, but explicit register naming allows direct reuse of intermediates like the product in R2 without duplication. Theoretically, both stack and register machines are equivalent in computational power, as each is Turing complete and can simulate the other through appropriate instruction sequences. However, their architectures interact differently with the von Neumann bottleneck, where stack machines' sequential operand access patterns can lead to more predictable memory traffic compared to the potentially random register-memory interactions in register machines.

Versus Register Machines: Performance and Optimization

Stack machines exhibit advantages in pipelining due to their implicit operand handling via the top-of-stack (TOS), which eliminates the need for a dedicated operand fetch stage common in register machines. This results in simpler 2-stage pipelines (instruction fetch and execute) compared to the 3-4 stages in register-based designs like RISC, reducing structural and data hazards. In stack architectures, operands are immediately available without explicit addressing, minimizing read-after-write (RAW) and write-after-read (WAR) hazards that require register renaming in register machines to maintain correctness during overlapping execution. Out-of-order execution is more readily supported in register machines, where explicit operand specifications allow hardware to identify true data dependencies and speculate more aggressively on independent instructions. Stack machines impose stricter constraints through implicit dependencies on the stack pointer and TOS, limiting reordering opportunities and potentially increasing stalls in dynamic schedulers. This makes register architectures better suited for workloads with irregular control flow, while stacks favor predictable, in-order execution. Many modern stack-based virtual machines mitigate these limitations by internally mapping stack operations to registers during just-in-time (JIT) compilation. For instance, the Java Virtual Machine (JVM) in HotSpot translates stack bytecode into register-allocated native code, leveraging linear-scan allocation to optimize for the host processor's registers and reduce memory accesses. This "hiding" of registers enables stack VMs to achieve performance comparable to native register code, with JIT optimizations like copy propagation eliminating a significant portion of redundant instructions. Interrupt handling is notably simpler in stack machines, as context saving involves pushing the entire stack state without explicit register spilling, often completing in as few as 4 clock cycles. Register machines, conversely, require saving multiple general-purpose registers to memory, incurring higher latency and complexity, especially in pipelined designs where pipeline flushes may be needed. In terms of code optimization, stack machines excel for straight-line code sequences, such as expression evaluation in algebraic languages like ALGOL, where the stack naturally mirrors recursive descent without variable allocation overhead. Register machines, however, facilitate superior optimization in loops and branches through graph-coloring register allocation, which minimizes spills and enables better instruction scheduling. Historical benchmarks illustrate these dynamics: On a numerical problem (in ALGOL on the B5500 and PL/I on the IBM 360/40), the stack-based Burroughs B5500 completed execution in 7 seconds, outperforming the register-oriented IBM 360/40's 22 seconds, despite the B5500's slower 6 µs memory cycle versus the 360's 2.5 µs. More recent studies on JIT-compiled VMs confirm that while register-based architectures are about 18% faster on average through fewer instructions, advanced optimizations narrow the gap to around 9% for stack machines in certain scenarios. Modern stack-based VMs like WebAssembly also leverage JIT compilation for near-native performance in web and serverless environments. In 2020s serverless computing, stack-based VMs like those underlying Java or WebAssembly benefit from JIT techniques in environments such as AWS Lambda, where tiered compilation (fast C1 for startup, optimizing C2 for throughput) achieves near-native performance by allocating registers dynamically, supporting high-scale function invocations with minimal cold-start overhead.

Applications in Interpreters and Compilers

Stack machines play a central role in the design of interpreters for stack-oriented languages, where program execution involves pushing operands onto a stack, performing operations on the top elements, and popping results. In Forth, a language developed in the 1970s, the interpreter evaluates expressions using reverse Polish notation (RPN) on a data stack, enabling interactive compilation and execution without explicit variable declarations for temporaries. This stack-based evaluation loop fetches, decodes, and executes primitives or calls to defined words, supporting real-time embedded systems. Similarly, PostScript, introduced by Adobe in 1982, employs a stack-based interpreter for rendering graphics and documents, where operators like addition consume two stack items and push the result, facilitating concise description of complex page layouts. In compilers, stack machines serve as intermediate representations (IRs) to simplify code generation from parse trees, as postfix traversal naturally maps to stack operations like push and binary ops. The p-code system for Pascal compilers, originating in the 1970s with UCSD Pascal, uses a stack-based virtual machine to produce portable bytecode that an interpreter executes via a fetch-decode-execute cycle, enabling cross-platform deployment without retargeting the entire compiler. This approach yields compact code, as instructions reference only stack depth rather than named registers, though it requires careful management of stack underflows. Modern examples include Perl 5's opcode interpreter, which uses a push/pop stack (the "PP stack") to evaluate expression trees during the runops loop, handling temporaries for operations like arithmetic and control flow. Lua's virtual machine, prior to version 5.0, relied on a pure stack model for bytecode execution, while later versions use stack frames within a register-based VM to manage activation records and locals. Stack-based IRs offer advantages in code generation and portability: generating bytecode from a parse tree is straightforward, as recursive descent produces RPN sequences directly mappable to stack pushes and pops, reducing the need for register allocation in early phases. Portable bytecode, as in p-code or PostScript, allows a single compiler output to run on diverse hardware via a target-specific interpreter, minimizing backend complexity. In just-in-time (JIT) compilers like V8 for JavaScript, stack-like temporaries appear in intermediate stages before register allocation, aiding initial bytecode generation from abstract syntax trees. The Ethereum Virtual Machine (EVM), specified in 2014, exemplifies this in blockchain: it executes smart contracts as stack-based bytecode, with a 1024-item stack for 256-bit words, ensuring deterministic state transitions for transactions. Despite these benefits, stack machines in interpreters and compilers face limitations in performance, often addressed by optimizations that transform stack code to register-based equivalents. Peephole optimizations scan short instruction sequences to eliminate redundant pushes/pops or reorder operations, converting stack-relative access to direct register loads for faster execution on hardware register machines. For instance, patterns like push A; push B; add; pop C can be rewritten as load reg1 A; load reg2 B; add reg1 reg2; store C reg1, reducing memory traffic. This is common in Forth compilers and bytecode VMs, where initial stack IR is optimized post-generation to approach register machine efficiency without full redesign.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.