Recent from talks
Nothing was collected or created yet.
X86 assembly language
View on WikipediaThis article needs additional citations for verification. (March 2020) |
x86 assembly language is a family of low-level programming languages that are used to produce object code for the x86 class of processors. These languages provide backward compatibility with CPUs dating back to the Intel 8008 microprocessor, introduced in April 1972.[1][2] As assembly languages, they are closely tied to the architecture's machine code instructions, allowing for precise control over hardware.
In x86 assembly languages, mnemonics are used to represent fundamental CPU instructions, making the code more human-readable compared to raw machine code. Each machine code instruction is an opcode which, in assembly, is replaced with a mnemonic.[3] Each mnemonic corresponds to a basic operation performed by the processor, such as arithmetic calculations, data movement, or control flow decisions. Assembly languages are most commonly used in applications where performance and efficiency are critical. This includes real-time embedded systems, operating-system kernels, and device drivers, all of which may require direct manipulation of hardware resources.
Additionally, compilers for high-level programming languages sometimes generate assembly code as an intermediate step during the compilation process. This allows for optimization at the assembly level before producing the final machine code that the processor executes.
Mnemonics and opcodes
[edit]Each instruction in the x86 assembly language is represented by a mnemonic which often combines with one or more operands to translate into one or more bytes known as an opcode. For example, the NOP instruction translates to the opcode 0x90, and the HLT instruction translates to 0xF4.[3] There are potential opcodes without documented mnemonics, which different processors may interpret differently. Using such opcodes can cause a program to behave inconsistently or even generate exceptions on some processors.
Syntax
[edit]x86 assembly language has two primary syntax branches: Intel syntax and AT&T syntax.[4] Intel syntax is dominant in the DOS and Windows environments, while AT&T syntax is dominant in Unix-like systems, as Unix was originally developed at AT&T Bell Labs.[5] Below is a summary of the main differences between Intel syntax and AT&T syntax:
| AT&T | Intel | |
|---|---|---|
| Parameter order | movl $5, %eax
|
mov eax, 5
|
| Parameter size | addl $0x24, %esp
movslq %ecx, %rax
paddd %xmm1, %xmm2
|
add esp, 24h
movsxd rax, ecx
paddd xmm2, xmm1
Width-based names may still appear in instructions when they define a different operation.
|
| Sigils | Immediate values prefixed with a "$", registers prefixed with a "%".[4] | The assembler automatically detects the type of symbols; i.e., whether they are registers, constants or something else. |
| Effective addresses | movl offset(%ebx, %ecx, 4), %eax
displacement(base, index, scale).
|
mov eax, [ebx + ecx*4 + offset]
|
Many x86 assemblers use Intel syntax, including FASM, MASM, NASM, TASM, and YASM. The GNU Assembler, which originally used AT&T syntax, has supported both syntaxes since version 2.10 via the .intel_syntax directive.[4][6][7] A quirk in the AT&T syntax for x86 is that x87 floating-point operands are reversed, an inherited bug from the original AT&T assembler.[8]
The AT&T syntax is nearly universal across other architectures (retaining the same operand order for the mov instruction); it was originally designed for PDP-11 assembly and was inherited onto Unix-like systems. In contrast, the Intel syntax is specific to the x86 architecture and is the one used in the x86 platform's official documentation. The Intel 8080, which predates the x86 architecture, also uses the "destination-first" order for mov instruction.[9]
Reserved words
[edit]In most x86 assembly languages, the reserved words consist of two parts: mnemonics that translate to opcodes, and directives (or "pseudo-ops") that access features in the assembler program beyond the simple translation of opcodes. For a list of the former part, see x86 instruction listings. The latter part is highly assembler-dependent, with no such thing as a standard among Intel-syntax assemblers.[10] AT&T-syntax assemblers share a common way of naming directives (all directives starts with a dot, like .ascii),[11] and a number of basic directives such as .ascii and .string are broadly supported.[12][13]
Registers
[edit]x86 processors feature a set of registers that serve as storage for binary data and addresses during program execution. These registers are categorized into general-purpose registers, segment registers, the instruction pointer, the FLAGS register, and various extension registers introduced in later processor models. Each register has specific functions in addition to their general capabilities:[3]
General-purpose registers
[edit]These registers have conventional roles, but usage is not strictly enforced. Programs are generally free to use them for other purposes.
- AX (Accumulator register): Primarily used in arithmetic, logic, and data transfer operations. It is favored by instructions that perform multiplication and division, and by string load and store operations. Immediate ALU operations and exchanges with AX can be encoded more compactly.
- BX (Base register): Base pointer for memory access. It can hold the base address of data structures and is useful in indexed addressing modes. It is used with
XLAT. - CX (Count register): Serves as a counter in loop, string, and shift/rotate instructions. Iterative operations often use CX to determine the number of times a loop or operation should execute.
- DX (Data register): Used in conjunction with AX for multiplication and division operations that produce results larger than 16 bits. It also holds I/O port addresses for
INandOUTinstructions. - SP (Stack pointer): Points to the top of stack in memory. It is automatically updated during
PUSHandPOPoperations. - BP (Base Pointer): Points to the top of the call stack. It is primarily used to access function parameters and local variables within the call stack.
- SI (Source Index): Used as a pointer to the source in string and memory array operations. Instructions like
MOVS(move string) use SI to read data from memory. Like BX, it can be used for indexing. It can be added to BP or BX for double indexing. - DI (Destination Index): Serves as a pointer to the destination in string and memory array operations. It works alongside SI in instructions that copy or compare data, writing results to memory. Like BX, it can be used for indexing. It can be added to BP or BX for double indexing.
Along with the general registers there are additionally the:
- Instruction Pointer (IP): Holds the offset address of the next instruction to be executed within the code segment (CS). It points to the first byte of the next instruction. While the IP register cannot be read directly by programmers, its value changes through control flow instructions such as jumps, calls, and interrupts, which alter the flow of execution.
- FLAGS register: Contains a set of status, control, and system flags that reflect the outcome of operations and control the processor's operations.
- Segment registers (CS, DS, ES, SS): Determines where a 64k segment starts (FS and GS in were added to 80386 and later)
- Extra extension registers (MMX, 3DNow!, SSE, etc.) (Pentium & later only).
The x86 registers can be used by most instructions. For example, in Intel syntax:
mov ax, 1234h ; copies the value 1234hex (4660d) into register AX
mov bx, ax ; copies the value of the AX register into the BX register
Segmented addressing
[edit]The x86 architecture in real and virtual 8086 mode uses a process known as segmentation to address memory, not the flat memory model used in many other environments. Segmentation involves composing a memory address from two parts, a segment and an offset; the segment points to the beginning of a 64 KiB (64×210) group of addresses and the offset determines how far from this beginning address the desired address is. In segmented addressing, two registers are required for a complete memory address. One to hold the segment, the other to hold the offset. In order to translate back into a flat address, the segment value is shifted four bits left (equivalent to multiplication by 24 or 16) then added to the offset to form the full address, which allows breaking the 64k barrier through clever choice of addresses, though it makes programming considerably more complex.
In real mode/protected only, for example, if DS contains the hexadecimal number 0xDEAD and DX contains the number 0xCAFE they would together point to the memory address 0xDEAD * 0x10 + 0xCAFE == 0xEB5CE. Therefore, the CPU can address up to 1,048,576 bytes (1 MiB) in real mode. By combining segment and offset values we find a 20-bit address.
The original IBM PC restricted programs to 640 KB but an expanded memory specification was used to implement a bank switching scheme that fell out of use when later operating systems, such as Windows, used the larger address ranges of newer processors and implemented their own virtual memory schemes.
Protected mode, starting with the Intel 80286, was utilized by OS/2. Several shortcomings, such as the inability to access the BIOS and the inability to switch back to real mode without resetting the processor, prevented widespread usage.[14] The 80286 was also still limited to addressing memory in 16-bit segments, meaning only 216 bytes (64 kilobytes) could be accessed at a time. To access the extended functionality of the 80286, the operating system would set the processor into protected mode, enabling 24-bit addressing and thus 224 bytes of memory (16 megabytes).
In protected mode, the segment selector can be broken down into three parts: a 13-bit index, a Table Indicator bit that determines whether the entry is in the GDT or LDT and a 2-bit Requested Privilege Level; see x86 memory segmentation.
When referring to an address with a segment and an offset the notation of segment:offset is used, so in the above example the flat address 0xEB5CE can be written as 0xDEAD:0xCAFE or as a segment and offset register pair; DS:DX.
There are some special combinations of segment registers and general registers that point to important addresses:
- CS:IP (CS is Code Segment, IP is Instruction Pointer) points to the address where the processor will fetch the next byte of code.
- SS:SP (SS is Stack Segment, SP is Stack Pointer) points to the address of the top of the stack, i.e. the most recently pushed byte.
- SS:BP (SS is Stack Segment, BP is Stack Frame Pointer) points to the address of the top of the stack frame, i.e. the base of the data area in the call stack for the currently active subprogram.
- DS:SI (DS is Data Segment, SI is Source Index) is often used to point to string data that is about to be copied to ES:DI.
- ES:DI (ES is Extra Segment, DI is Destination Index) is typically used to point to the destination for a string copy, as mentioned above.
The Intel 80386 featured three operating modes: real mode, protected mode and virtual mode. The protected mode which debuted in the 80286 was extended to allow the 80386 to address up to 4 GB of memory, the all new virtual 8086 mode (VM86) made it possible to run one or more real mode programs in a protected environment which largely emulated real mode, though some programs were not compatible (typically as a result of memory addressing tricks or using unspecified op-codes).
The 32-bit flat memory model of the 80386's extended protected mode may be the most important feature change for the x86 processor family until AMD released x86-64 in 2003, as it helped drive large scale adoption of Windows 3.1 (which relied on protected mode) since Windows could now run many applications at once, including DOS applications, by using virtual memory and simple multitasking.
Execution modes
[edit]The x86 processors support five modes of operation for x86 code, Real Mode, Protected Mode, Long Mode, Virtual 86 Mode, and System Management Mode, in which some instructions are available and others are not. A 16-bit subset of instructions is available on the 16-bit x86 processors, which are the 8086, 8088, 80186, 80188, and 80286. These instructions are available in real mode on all x86 processors, and in 16-bit protected mode (80286 onwards), additional instructions relating to protected mode are available. On the 80386 and later, 32-bit instructions (including later extensions) are also available in all modes, including real mode; on these CPUs, V86 mode and 32-bit protected mode are added, with additional instructions provided in these modes to manage their features. SMM, with some of its own special instructions, is available on some Intel i386SL, i486 and later CPUs. Finally, in long mode (AMD Opteron onwards), 64-bit instructions, and more registers, are also available. The instruction set is similar in each mode but memory addressing and word size vary, requiring different programming strategies.
The modes in which x86 code can be executed in are:
- Real mode (16-bit)
- 20-bit segmented memory address space (meaning that only 1 MB of memory can be addressed— actually since 80286 a little more through HMA), direct software access to peripheral hardware, and no concept of memory protection or multitasking at the hardware level. Computers that use BIOS start up in this mode.
- Protected mode (16-bit and 32-bit)
- Expands addressable physical memory to 16 MB and addressable virtual memory to 1 GB. Provides privilege levels and protected memory, which prevents programs from corrupting one another. 16-bit protected mode (used during the end of the DOS era) used a complex, multi-segmented memory model. 32-bit protected mode uses a simple, flat memory model.
- Long mode (64-bit)
- Mostly an extension of the 32-bit (protected mode) instruction set, but unlike the 16–to–32-bit transition, many instructions were dropped in the 64-bit mode. Pioneered by AMD.
- Virtual 8086 mode (16-bit)
- A special hybrid operating mode that allows real mode programs and operating systems to run while under the control of a protected mode supervisor operating system
- System Management Mode (16-bit)
- Handles system-wide functions like power management, system hardware control, and proprietary OEM designed code. It is intended for use only by system firmware. All normal execution, including the operating system, is suspended. An alternate software system (which usually resides in the computer's firmware, or a hardware-assisted debugger) is then executed with high privileges.
Switching modes
[edit]The processor runs in real mode immediately after power on, so an operating system kernel, or other program, must explicitly switch to another mode if it wishes to run in anything but real mode. Switching modes is accomplished by modifying certain bits of the processor's control registers after some preparation, and some additional setup may be required after the switch.
Examples
[edit]With a computer running legacy BIOS, the BIOS and the boot loader run in Real mode. The 64-bit operating system kernel checks and switches the CPU into Long mode and then starts new kernel-mode threads running 64-bit code.
With a computer running UEFI, the UEFI firmware (except CSM and legacy Option ROM), the UEFI boot loader and the UEFI operating system kernel all run in Long mode.
Instruction types
[edit]In general, the features of the modern x86 instruction set are:
- A compact encoding
- Variable length and alignment independent (encoded as little endian, as is all data in the x86 architecture)
- Mainly one-address and two-address instructions, that is to say, the first operand is also the destination.
- Memory operands as both source and destination are supported (frequently used to read/write stack elements addressed using small immediate offsets).
- Both general and implicit register usage; although all seven (counting
ebp) general registers in 32-bit mode, and all fifteen (countingrbp) general registers in 64-bit mode, can be freely used as accumulators or for addressing, most of them are also implicitly used by certain (more or less) special instructions; affected registers must therefore be temporarily preserved (normally stacked), if active during such instruction sequences.
- Produces conditional flags implicitly through most integer ALU instructions.
- Supports various addressing modes including immediate, offset, and scaled index but not PC-relative, except jumps (introduced as an improvement in the x86-64 architecture).
- Includes floating point to a stack of registers.
- Contains special support for atomic read-modify-write instructions (
xchg,cmpxchg/cmpxchg8b,xadd, and integer instructions which combine with thelockprefix) - SIMD instructions (instructions which perform parallel simultaneous single instructions on many operands encoded in adjacent cells of wider registers).
Stack instructions
[edit]The x86 architecture has hardware support for an execution stack mechanism. Instructions such as push, pop, call and ret are used with the properly set up stack to pass parameters, to allocate space for local data, and to save and restore call-return points. The ret size instruction is very useful for implementing space efficient (and fast) calling conventions where the callee is responsible for reclaiming stack space occupied by parameters.
When setting up a stack frame to hold local data of a recursive procedure there are several choices; the high level enter instruction (introduced with the 80186) takes a procedure-nesting-depth argument as well as a local size argument, and may be faster than more explicit manipulation of the registers (such as push bp ; mov bp, sp ; sub sp, size). Whether it is faster or slower depends on the particular x86-processor implementation as well as the calling convention used by the compiler, programmer or particular program code; most x86 code is intended to run on x86-processors from several manufacturers and on different technological generations of processors, which implies highly varying microarchitectures and microcode solutions as well as varying gate- and transistor-level design choices.
The full range of addressing modes (including immediate and base+offset) even for instructions such as push and pop, makes direct usage of the stack for integer, floating point and address data simple, as well as keeping the ABI specifications and mechanisms relatively simple compared to some RISC architectures (require more explicit call stack details).
Integer ALU instructions
[edit]x86 assembly has the standard mathematical operations, add, sub, neg, imul and idiv (for signed integers), with mul and div (for unsigned integers); the logical operators and, or, xor, not; bitshift arithmetic and logical, sal/sar (for signed integers), shl/shr (for unsigned integers); rotate with and without carry, rcl/rcr, rol/ror, a complement of BCD arithmetic instructions, aaa, aad, daa and others.
Floating-point instructions
[edit]x86 assembly language includes instructions for a stack-based floating-point unit (FPU). The FPU was an optional separate coprocessor for the 8086 through the 80386, it was an on-chip option for the 80486 series, and it is a standard feature in every Intel x86 CPU since the 80486, starting with the Pentium. The FPU instructions include addition, subtraction, negation, multiplication, division, remainder, square roots, integer truncation, fraction truncation, and scale by power of two. The operations also include conversion instructions, which can load or store a value from memory in any of the following formats: binary-coded decimal, 32-bit integer, 64-bit integer, 32-bit floating-point, 64-bit floating-point or 80-bit floating-point (upon loading, the value is converted to the currently used floating-point mode). x86 also includes a number of transcendental functions, including sine, cosine, tangent, arctangent, exponentiation with the base 2 and logarithms to bases 2, 10, or e.
The stack register to stack register format of the instructions is usually fop st, st(n) or fop st(n), st, where st is equivalent to st(0), and st(n) is one of the 8 stack registers (st(0), st(1), ..., st(7)). Like the integers, the first operand is both the first source operand and the destination operand. fsubr and fdivr should be singled out as first swapping the source operands before performing the subtraction or division. The addition, subtraction, multiplication, division, store and comparison instructions include instruction modes that pop the top of the stack after their operation is complete. So, for example, faddp st(1), st performs the calculation st(1) = st(1) + st(0), then removes st(0) from the top of stack, thus making what was the result in st(1) the top of the stack in st(0).
SIMD instructions
[edit]Modern x86 CPUs contain SIMD instructions, which largely perform the same operation in parallel on many values encoded in a wide SIMD register. Various instruction technologies support different operations on different register sets, but taken as complete whole (from MMX to SSE4.2) they include general computations on integer or floating-point arithmetic (addition, subtraction, multiplication, shift, minimization, maximization, comparison, division or square root). So for example, paddw mm0, mm1 performs 4 parallel 16-bit (indicated by the w) integer adds (indicated by the padd) of mm0 values to mm1 and stores the result in mm0. Streaming SIMD Extensions or SSE also includes a floating-point mode in which only the very first value of the registers is actually modified (expanded in SSE2). Some other unusual instructions have been added including a sum of absolute differences (used for motion estimation in video compression, such as is done in MPEG) and a 16-bit multiply accumulation instruction (useful for software-based alpha-blending and digital filtering). SSE (since SSE3) and 3DNow! extensions include addition and subtraction instructions for treating paired floating-point values like complex numbers.
These instruction sets also include numerous fixed sub-word instructions for shuffling, inserting and extracting the values around within the registers. In addition there are instructions for moving data between the integer registers and XMM (used in SSE)/FPU (used in MMX) registers.
Memory instructions
[edit]The x86 processor also includes complex addressing modes for addressing memory with an immediate offset, a register, a register with an offset, a scaled register with or without an offset, and a register with an optional offset and another scaled register. So for example, one can encode mov eax, [Table + ebx + esi*4] as a single instruction which loads 32 bits of data from the address computed as (Table + ebx + esi * 4) offset from the ds selector, and stores it to the eax register. In general x86 processors can load and use memory matched to the size of any register it is operating on. (The SIMD instructions also include half-load instructions.)
Most 2-operand x86 instructions, including integer ALU instructions, use a standard "addressing mode byte"[15] often called the MOD-REG-R/M byte.[16][17][18] Many 32-bit x86 instructions also have a SIB addressing mode byte that follows the MOD-REG-R/M byte.[19][20][21][22][23]
In principle, because the instruction opcode is separate from the addressing mode byte, those instructions are orthogonal because any of those opcodes can be mixed-and-matched with any addressing mode. However, the x86 instruction set is generally considered non-orthogonal because most dyadic operations cannot operate memory to memory, other opcodes have some fixed addressing mode (they have no addressing mode byte), and every register has a preferred use.[23][24]
The x86 instruction set includes string load, store, move, scan and compare instructions (lods, stos, movs, scas and cmps) which perform each operation to a specified size (b for 8-bit byte, w for 16-bit word, d for 32-bit double word) then increments/decrements (depending on DF, direction flag) the implicit address register (si for lods, di for stos and scas, and both for movs and cmps). For the load, store and scan operations, the implicit target/source/comparison register is in the al, ax or eax register (depending on size). The implicit segment registers used are ds for si and es for di. The cx or ecx register is used as a decrementing counter, and the operation stops when the counter reaches zero or, for scans and comparisons, when equality or inequality is detected. Unfortunately, over the years the performance of some of these instructions became neglected and in certain cases it is possible to get faster results by coding using more elemental instructions. Intel and AMD have refreshed some of the instructions though, and as of 2025[update] some have very respectable performance.
The stack is a region of memory and an associated stack pointer, which points to the last item pushed on the stack. The stack pointer is decremented before items are added, push, and incremented after things are removed, pop. In 16-bit mode, this implicit stack pointer is addressed as SS:[SP], in 32-bit mode it is SS:[ESP], and in 64-bit mode it is [RSP]. The stack pointer points to the last value that was stored, under the assumption that its size will match the operating mode of the processor (i.e., 16, 32, or 64 bits) to match the default width of the push/pop/call/ret instructions. Also included are the instructions enter and leave which reserve and remove data from the top of the stack while setting up a stack frame pointer in bp/ebp/rbp. However, direct setting, or addition and subtraction to the sp/esp/rsp register is also supported, so the enter/leave instructions are generally unnecessary.
This code is the beginning of a function typical for a high-level language when compiler optimisation is turned off for ease of debugging:
push rbp ; Save the calling function’s stack frame pointer (rbp register)
mov rbp, rsp ; Make a new stack frame below our caller’s stack
sub rsp, 32 ; Reserve 32 bytes of stack space for this function’s local variables.
; Local variables will be below rbp and can be referenced relative to rbp,
; again best for ease of debugging, but for best performance rbp will not
; be used at all, and local variables would be referenced relative to rsp
; because, apart from the code saving, rbp then is free for other uses.
… … ; However, if rbp is altered here, its value should be preserved for the caller.
mov [rbp-8], rdx ; Example of writing to a local variable (by its memory location) from register rdx
...is functionally equivalent to just:
enter 32, 0
Other instructions for manipulating the stack include pushfd(32-bit) / pushfq(64-bit) and popfd/popfq for storing and retrieving the EFLAGS (32-bit) / RFLAGS (64-bit) register.
Values for a SIMD load or store are assumed to be packed in adjacent positions for the SIMD register and will align them in sequential little-endian order. Some SSE load and store instructions require 16-byte alignment to function properly. The SIMD instruction sets also include "prefetch" instructions which perform the load but do not target any register, used for cache loading. The SSE instruction sets also include non-temporal store instructions which will perform stores straight to memory without performing a cache allocate if the destination is not already cached (otherwise it will behave like a regular store.)
Most generic integer and floating-point (but no SIMD) instructions can use one parameter as a complex address as the second source parameter. Integer instructions can also accept one memory parameter as a destination operand.
Program flow
[edit]The x86 assembly has an unconditional jump operation, jmp, which can take an immediate address, a register or an indirect address as a parameter (note that most RISC processors only support a link register or short immediate displacement for jumping).
Also supported are several conditional jumps, including jz (jump on zero), jnz (jump on non-zero), jg (jump on greater than, signed), jl (jump on less than, signed), ja (jump on above/greater than, unsigned), jb (jump on below/less than, unsigned). These conditional operations are based on the state of specific bits in the (E)FLAGS register. Many arithmetic and logic operations set, clear or complement these flags depending on their result. The comparison cmp (compare) and test instructions set the flags as if they had performed a subtraction or a bitwise AND operation, respectively, without altering the values of the operands. There are also instructions such as clc (clear carry flag) and cmc (complement carry flag) which work on the flags directly. Floating point comparisons are performed via fcom or ficom instructions which eventually have to be converted to integer flags.
Each jump operation has three different forms, depending on the size of the operand. A short jump uses an 8-bit signed operand, which is a relative offset from the current instruction. A near jump is similar to a short jump but uses a 16-bit signed operand (in real or protected mode) or a 32-bit signed operand (in 32-bit protected mode only). A far jump is one that uses the full segment base:offset value as an absolute address. There are also indirect and indexed forms of each of these.
In addition to the simple jump operations, there are the call (call a subroutine) and ret (return from subroutine) instructions. Before transferring control to the subroutine, call pushes the segment offset address of the instruction following the call onto the stack; ret pops this value off the stack, and jumps to it, effectively returning the flow of control to that part of the program. In the case of a far call, the segment base is pushed following the offset; far ret pops the offset and then the segment base to return.
There are also two similar instructions, int (interrupt), which saves the current (E)FLAGS register value on the stack, then performs a far call, except that instead of an address, it uses an interrupt vector, an index into a table of interrupt handler addresses. Typically, the interrupt handler saves all other CPU registers it uses, unless they are used to return the result of an operation to the calling program (in software called interrupts). The matching return from interrupt instruction is iret, which restores the flags after returning. Soft Interrupts of the type described above are used by some operating systems for system calls, and can also be used in debugging hard interrupt handlers. Hard interrupts are triggered by external hardware events, and must preserve all register values as the state of the currently executing program is unknown. In Protected Mode, interrupts may be set up by the OS to trigger a task switch, which will automatically save all registers of the active task.
Examples
[edit]This article possibly contains original research. (March 2013) |
The following examples use the so-called Intel-syntax flavor as used by the assemblers Microsoft MASM, NASM and many others. (Note: There is also an alternative AT&T-syntax flavor where the order of source and destination operands are swapped, among many other differences.)[25]
"Hello world!" program for MS-DOS in MASM-style assembly
[edit]Using the software interrupt 21h instruction to call the MS-DOS operating system for output to the display – other samples use libc's C printf() routine to write to stdout. Note that the first example is an example using 16-bit mode as on an Intel 8086. The second example is Intel 386 code in 32-bit mode. Modern code will be in 64-bit mode.[26]
.model small
.stack 100h
.data
msg db 'Hello world!$'
.code
start:
mov ah, 09h ; Sets 8-bit register ‘ah’, the high byte of register ax, to 9, to
; select a sub-function number of an MS-DOS routine called below
; via the software interrupt int 21h to display a message
lea dx, msg ; Takes the address of msg, stores the address in 16-bit register dx
int 21h ; Various MS-DOS routines are callable by the software interrupt 21h
; Our required sub-function was set in register ah above
mov ax, 4C00h ; Sets register ax to the sub-function number for MS-DOS’s software
; interrupt int 21h for the service ‘terminate program’.
int 21h ; Calling this MS-DOS service never returns, as it ends the program.
end start
"Hello world!" program for Windows in MASM and NASM style assembly
[edit]| ! MASM | NASM | Description |
|---|---|---|
; requires /coff switch on 6.15 and earlier versions
.386
.model small,c
.stack 1000h
|
; Image base = 0x00400000
%define RVA(x) (x-0x00400000)
|
Preamble. MASM requires defining the address model and stack size. |
.data
msg db "Hello world!",0
|
section .data
msg db "Hello world!"
|
Data section. We use the db (define byte) pseudo-op to define a string. |
.code
includelib libcmt.lib
includelib libvcruntime.lib
includelib libucrt.lib
includelib legacy_stdio_definitions.lib
extrn printf:near
extrn exit:near
public main
main proc
push offset msg
call printf
push 0
call exit
main endp
end
|
section .text
push dword msg
call dword [printf]
push byte +0
call dword [exit]
ret
section .idata
dd RVA(msvcrt_LookupTable)
dd -1
dd 0
dd RVA(msvcrt_string)
dd RVA(msvcrt_imports)
times 5 dd 0 ; ends the descriptor table
msvcrt_string dd "msvcrt.dll", 0
msvcrt_LookupTable:
dd RVA(msvcrt_printf)
dd RVA(msvcrt_exit)
dd 0
msvcrt_imports:
printf dd RVA(msvcrt_printf)
exit dd RVA(msvcrt_exit)
dd 0
msvcrt_printf:
dw 1
dw "printf", 0
msvcrt_exit:
dw 2
dw "exit", 0
dd 0
|
The code (.text section) and the import table. In NASM the import table is manually constructed, while in the MASM example directives are used to simplify the process. |
"Hello world!" program for Linux in AT&T and NASM assembly
[edit]| AT&T (GNU as) | Intel (NASM) | Description |
|---|---|---|
.data
|
section .data
|
Like in the Windows example, .data is the section for initialized data.
|
str: .ascii "Hello, world!\n"
|
str: db 'Hello world!', 0Ah
|
Define a string of text containing "Hello, world!" and then a new line (\n, which is 0x0A). Bind the label "str" to the address of the defined string.
|
str_len = . - str
|
str_len: equ $ - str
|
Calculate the length of str. . means "here" in gas and $ means the same in nasm. By subtracting "str" from "here", one gets the length of the previously defined string.
|
.text
|
section .text
|
Like in the Windows example, .text is the section for program code.
|
.globl _start
|
global _start
|
export the _start function to the global scope for it to be "seen" by the linker |
_start:
|
_start:
|
Define a label called _start, to which we will write our subroutine. The name _start, by Linux convention, defines the entry point.
|
movl $4, %eax
movl $1, %ebx
movl $str, %ecx
movl $str_len, %edx
|
mov eax, 4
mov ebx, 1
mov ecx, str
mov edx, str_len
|
Prepare a system call. EAX=4 requests the "sys_write" call on Linux x86. EBX=1 means "stdout" for sys_write. ECX holds the string to write, and EDX holds the number of bytes to write. The is equivalent to the libc-wrapped version write(1, str, str_len).
|
int $0x80
|
int 80h
|
On x86, the system interrupt "80h" is used for invoking a system call according to the values of eax, ebx, ecx, and edx. |
movl $1, %eax
movl $0, %ebx
int $0x80
|
mov eax, 1
mov ebx, 0
int 80h
|
Load another system call, then call it with INT 80h: EAX=1 is sys_exit, and EBX for sys_exit holds the return value. A return value of 0 means a normal exit. In C syntax, _exit(0);.
|
Note for NASM:
; This program runs in 32-bit protected mode. ; build: nasm -f elf -F stabs name.asm ; link: ld -o name name.o ; ; In 64-bit long mode you can use 64-bit registers (e.g. rax instead of eax, rbx instead of ebx, etc.) ; Also change "-f elf " for "-f elf64" in build command. ; For 64-bit long mode, "lea rcx, str" would be the address of the message, note 64-bit register rcx.
"Hello world!" program for Linux in NASM style assembly using the C standard library
[edit];
; This program runs in 32-bit protected mode.
; gcc links the standard-C library by default
; build: nasm -f elf -F stabs name.asm
; link: gcc -o name name.o
;
; In 64-bit long mode you can use 64-bit registers (e.g. rax instead of eax, rbx instead of ebx, etc..)
; Also change "-f elf " for "-f elf64" in build command.
;
global main ; ‘main’ must be defined, as it being compiled
; against the C Standard Library
extern printf ; declares the use of external symbol, as printf
; printf is declared in a different object-module.
; The linker resolves this symbol later.
segment .data ; section for initialized data
string db 'Hello world!', 0Ah, 0 ; message string ending with a newline char (10
; decimal) and the zero byte ‘NUL’ terminator
; ‘string’ now refers to the starting address
; at which 'Hello, World' is stored.
segment .text
main:
push string ; Push the address of ‘string’ onto the stack.
; This reduces esp by 4 bytes before storing
; the 4-byte address ‘string’ into memory at
; the new esp, the new bottom of the stack.
; This will be an argument to printf()
call printf ; calls the C printf() function.
add esp, 4 ; Increases the stack-pointer by 4 to put it back
; to where it was before the ‘push’, which
; reduced it by 4 bytes.
ret ; Return to our caller.
Because the C runtime is used, we define a main() function as the C runtime expects. Instead of calling exit, we simply return from the main function to have the runtime perform the clean-up.
"Hello world!" program for 64-bit mode Linux in NASM style assembly
[edit]This example is in modern 64-bit mode.
; build: nasm -f elf64 -F dwarf hello.asm
; link: ld -o hello hello.o
DEFAULT REL ; use RIP-relative addressing modes by default, so [foo] = [rel foo]
SECTION .rodata ; read-only data should go in the .rodata section on GNU/Linux, like .rdata on Windows
Hello: db "Hello world!", 10 ; Ending with a byte 10 = newline (ASCII LF)
len_Hello: equ $-Hello ; Get NASM to calculate the length as an assembly-time constant
; the ‘$’ symbol means ‘here’. write() takes a length so that
; a zero-terminated C-style string isn't needed.
; It would be for C puts()
SECTION .text
global _start
_start:
mov eax, 1 ; __NR_write syscall number from Linux asm/unistd_64.h (x86_64)
mov edi, 1 ; int fd = STDOUT_FILENO
lea rsi, [rel Hello] ; x86-64 uses RIP-relative LEA to put static addresses into regs
mov rdx, len_Hello ; size_t count = len_Hello
syscall ; write(1, Hello, len_Hello); call into the kernel to actually do the system call
;; return value in RAX. RCX and R11 are also overwritten by syscall
mov eax, 60 ; __NR_exit call number (x86_64) is stored in register eax.
xor edi, edi ; This zeros edi and also rdi.
; This xor-self trick is the preferred common idiom for zeroing
; a register, and is always by far the fastest method.
; When a 32-bit value is stored into eg edx, the high bits 63:32 are
; automatically zeroed too in every case. This saves you having to set
; the bits with an extra instruction, as this is a case very commonly
; needed, for an entire 64-bit register to be filled with a 32-bit value.
; This sets our routine’s exit status = 0 (exit normally)
syscall ; _exit(0)
Running it under strace verifies that no extra system calls are made in the process. The printf version would make many more system calls to initialize libc and do dynamic linking. But this is a static executable because we linked using ld without -pie or any shared libraries; the only instructions that run in user-space are the ones you provide.
$ strace ./hello > /dev/null # without a redirect, your program's stdout is mixed with strace's logging on stderr. Which is normally fine
execve("./hello", ["./hello"], 0x7ffc8b0b3570 /* 51 vars */) = 0
write(1, "Hello world!\n", 13) = 13
exit(0) = ?
+++ exited with 0 +++
Using the flags register
[edit]Flags are heavily used for comparisons in the x86 architecture. When a comparison is made between two data, the CPU sets the relevant flag or flags. Following this, conditional jump instructions can be used to check the flags and branch to code that should run, e.g.:
cmp eax, ebx
jne do_something
; ...
do_something:
; do something here
Aside, from compare instructions, there are a great many arithmetic and other instructions that set bits in the flags register. Other examples are the instructions sub, test and add and there are many more. Common combinations such as cmp + conditional jump are internally ‘fused’ (‘macro fusion’) into one single micro-instruction (μ-op) and are fast provided the processor can guess which way the conditional jump will go, jump vs continue.
The flags register are also used in the x86 architecture to turn on and off certain features or execution modes. For example, to disable all maskable interrupts, you can use the instruction:
cli
The flags register can also be directly accessed. The low 8 bits of the flag register can be loaded into ah using the lahf instruction. The entire flags register can also be moved on and off the stack using the instructions pushfd/pushfq, popfd/popfq, int (including into) and iret.
The x87 floating point maths subsystem also has its own independent ‘flags’-type register the fp status word. In the 1990s it was an awkward and slow procedure to access the flag bits in this register, but on modern processors there are ‘compare two floating point values’ instructions that can be used with the normal conditional jump/branch instructions directly without any intervening steps.
Using the instruction pointer register
[edit]The instruction pointer is called ip in 16-bit mode, eip in 32-bit mode, and rip in 64-bit mode. The instruction pointer register points to the address of the next instruction that the processor will attempt to execute. It cannot be directly accessed in 16-bit or 32-bit mode, but a sequence like the following can be written to put the address of next_line into eax (32-bit code):
call next_line
next_line:
pop eax
Writing to the instruction pointer is simple — a jmp instruction stores the given target address into the instruction pointer to, so, for example, a sequence like the following will put the contents of rax into rip (64-bit code):
jmp rax
In 64-bit mode, instructions can reference data relative to the instruction pointer, so there is less need to copy the value of the instruction pointer to another register.
See also
[edit]References
[edit]- ^ "Intel 8008 (i8008) microprocessor family". www.cpu-world.com. Retrieved 2021-03-25.
- ^ "Intel 8008". CPU MUSEUM - MUSEUM OF MICROPROCESSORS & DIE PHOTOGRAPHY. Retrieved 2021-03-25.
- ^ a b c "Intel 8008 OPCODES". www.pastraiser.com. Retrieved 2021-03-25.
- ^ a b c d e Narayam, Ram (2007-10-17). "Linux assemblers: A comparison of GAS and NASM". IBM. Archived from the original on October 3, 2013. Retrieved 2008-07-02.
- ^ "The Creation of Unix". Archived from the original on April 2, 2014.
- ^ Hyde, Randall. "Which Assembler is the Best?". Archived from the original on 2007-10-18. Retrieved 2008-05-18.
- ^ "GNU Assembler News, v2.1 supports Intel syntax". 2008-04-04. Retrieved 2008-07-02.[permanent dead link]
- ^ "i386-Bugs (Using as)". Binutils documentation. Retrieved 15 January 2020.
- ^ "Intel 8080 Assembly Language Programming Manual" (PDF). Retrieved 12 May 2023.
- ^ "NASM - The Netwide Assembler". www.nasm.us.
- ^ "Statements (Using as)". sourceware.org.
- ^ "Pseudo Ops (Using as) :: Assembler Directives". sourceware.org.
- ^ "Assembler Directives - x86 Assembly Language Reference Manual". docs.oracle.com.
- ^ Mueller, Scott (March 24, 2006). "P2 (286) Second-Generation Processors". Upgrading and Repairing PCs, 17th Edition (Book) (17 ed.). Que. ISBN 0-7897-3404-4. Retrieved 2017-12-06.
- ^ Curtis Meadow. "Encoding of 8086 Instructions".
- ^ Igor Kholodov. "6. Encoding x86 Instruction Operands, MOD-REG-R/M Byte".
- ^ "Encoding x86 Instructions".
- ^ Michael Abrash. "Zen of Assembly Language: Volume I, Knowledge". "Chapter 7: Memory Addressing". Section "mod-reg-rm Addressing" Archived 2022-03-04 at the Wayback Machine.
- ^ Intel 80386 Reference Programmer's Manual. "17.2.1 ModR/M and SIB Bytes"
- ^ "X86-64 Instruction Encoding: ModR/M and SIB bytes"
- ^ "Figure 2-1. Intel 64 and IA-32 Architectures Instruction Format".
- ^ "x86 Addressing Under the Hood".
- ^ a b Stephen McCamant. "Manual and Automated Binary Reverse Engineering".
- ^ "X86 Instruction Wishlist".
- ^ Peter Cordes (18 December 2011). "NASM (Intel) versus AT&T Syntax: what are the advantages?". Stack Overflow.
- ^ "I just started Assembly". daniweb.com. 2008.
Further reading
[edit]Manuals
[edit]Books
[edit]- Ed, Jorgensen (May 2018). x86-64 Assembly Language Programming with Ubuntu (PDF) (1.0.97 ed.). p. 367.
X86 assembly language
View on Grokipediamov eax, ebx), as used in official Intel documentation, or AT&T syntax, common in Unix-like systems and GAS (e.g., movl %ebx, %eax), which prefixes operands with sizes and uses percent signs for registers.[1] Despite its complexity due to backward compatibility and irregular instruction encodings, x86 assembly remains vital for embedded systems, reverse engineering, and high-performance computing where higher-level languages fall short.[1]
Overview
History and Evolution
The x86 assembly language originated with the Intel 8086 microprocessor, introduced in 1978 as a 16-bit complex instruction set computing (CISC) architecture designed to support advanced applications and serve as a template for future processors.[7] Developed in just 18 months, the 8086 featured microcode implementation and became the foundation for the x86 family, powering the IBM PC released in 1981, which used the closely related 8088 variant and established widespread software and hardware compatibility standards.[7] This integration into the IBM PC ecosystem ensured the persistence of x86 despite the rise of reduced instruction set computing (RISC) alternatives, as backward compatibility drove industry adoption and locked in a vast software base.[8] The architecture evolved significantly with the Intel 80286 in 1982, which introduced protected mode to enable multitasking and memory protection, enhancing system reliability for emerging multi-user environments.[9] This was followed by the Intel 80386 in 1985, marking the shift to 32-bit processing with support for virtual memory and a flat memory model, allowing larger address spaces and improved efficiency for operating systems like Windows.[10] The Pentium series, launched in 1993, advanced the design with superscalar execution for parallel instruction processing, dropping the "86" suffix while maintaining compatibility to sustain the PC market's growth.[10] A pivotal extension occurred in 2003 with the introduction of 64-bit addressing via AMD's AMD64 architecture, which Intel adopted as Intel 64 in 2004, enabling larger memory capacities and enhanced performance for data-intensive applications without breaking legacy support.[11] Key instruction set extensions further propelled x86's relevance: MMX in 1996 added multimedia acceleration to the Pentium MMX; SSE in 1999 with Pentium III introduced SIMD for vector operations; AVX in 2011 expanded vector widths to 256 bits for high-performance computing; and AVX-512 in 2016 provided 512-bit vectors optimized for AI and machine learning workloads.[10][12] In 2023, Intel announced AVX10, a converged instruction set incorporating AVX-512 features to simplify implementation across processors. These developments maintained x86's dominance by balancing innovation with the enduring IBM PC compatibility legacy.[7][13]Key Characteristics and Usage
x86 assembly language is rooted in the Complex Instruction Set Computing (CISC) architecture, which supports a diverse array of instructions designed to perform complex operations in a single command, contrasting with the simpler, fixed-length instructions typical of Reduced Instruction Set Computing (RISC) designs.[1] This CISC approach enables x86 instructions to vary in length from 1 to 15 bytes, allowing for flexible encoding that optimizes for both common and specialized tasks while maintaining high code density.[14] A hallmark of the x86 architecture is its strong emphasis on backward compatibility, supporting execution in 16-bit, 32-bit, and 64-bit modes through mechanisms like compatibility mode in x86-64, which permits unmodified legacy applications to run alongside modern 64-bit software without requiring emulation.[1] In practice, x86 assembly is primarily employed in domains demanding precise control and efficiency, such as kernel development where it facilitates low-level system calls and interrupt handling, device drivers for direct hardware interaction, and embedded systems constrained by resource limitations.[15] It also plays a key role in performance-critical applications like game engines, where optimized routines enhance rendering and physics simulations, and in compiler optimization through inline assembly embedded in higher-level languages like C/C++ to bypass generated code inefficiencies.[16] Despite these strengths, x86 assembly presents challenges due to its inherent complexity, including variable instruction lengths and intricate addressing modes that can lead to programming errors and difficult debugging.[17] However, it offers significant advantages in code density, reducing program size compared to equivalent RISC implementations, and provides unparalleled direct hardware control, enabling fine-tuned access to CPU registers, memory, and peripherals for maximal performance.[18][15] As of 2025, x86 remains the dominant architecture in desktops and servers, holding the majority market share powered by Intel and AMD processors.[19][20] Its relevance persists in security research, where assembly-level analysis uncovers vulnerabilities in low-level code, and in just-in-time (JIT) compilers for JavaScript engines like V8 and SpiderMonkey, which generate optimized x86 machine code to accelerate web applications while posing novel attack surfaces studied in defenses against JIT spraying and code reuse exploits.[21][22]Syntax and Notation
Syntax Variants
x86 assembly language supports multiple syntax variants, each tailored to different assemblers and development environments, primarily differing in operand ordering, notation for registers and memory, and directive usage. The most prominent variants are Intel syntax, used by assemblers like Microsoft's MASM, and AT&T syntax, employed by the GNU Assembler (GAS).[23][24] Intel syntax, as implemented in MASM, places the destination operand before the source (e.g.,mov rax, rbx), aligning with the conventional reading of instructions from left to right. Registers are denoted without prefixes (e.g., rax), memory addresses use square brackets (e.g., [rcx + r10 * 2 + 100h]), and data sizes are specified via qualifiers like DWORD PTR when ambiguous (e.g., mov eax, DWORD PTR [ecx]). Directives include .data for initialized data sections and .code for executable code, with EQU for defining constants (e.g., myvar EQU 100). Comments begin with a semicolon (;). This syntax is prevalent in Windows development tools due to its integration with Microsoft ecosystems.[23]
In contrast, AT&T syntax in GAS reverses the operand order, placing sources before destinations (e.g., movl %esi, %ebx), and requires explicit size suffixes on mnemonics (e.g., movl for 32-bit, movb for 8-bit). Registers are prefixed with % (e.g., %eax), immediates with $ (e.g., movb $10, %al), and memory operands use parentheses with an offset-base format (e.g., 4(%esp)). Directives such as .data and .text organize sections, and comments start with #. This variant originated from Unix systems and emphasizes explicitness to avoid ambiguity in operand types.[24]
Other assemblers introduce portable or specialized variants of Intel syntax. NASM employs a clean, portable Intel-like syntax with destination-first ordering (e.g., mov eax, ebx), mandatory square brackets for memory (e.g., [ebx + esi * 4 + 10]), and no register prefixes. It uses section .data and section .text for segments, EQU for constants (e.g., MAX EQU 100), and ; for comments. NASM's design prioritizes cross-platform compatibility and modularity.[25]
FASM adopts a flat-model-focused Intel syntax, also destination-first (e.g., mov eax, [ebx]), with square brackets for memory and size operators like dword (e.g., mov eax, dword [100h]). Equates use = (e.g., x = 1), sections are defined via section directives similar to NASM, and comments use ;. FASM emphasizes optimization and self-assembly, supporting multiple passes for code size reduction without high-level MASM constructs like PROC.[26]
Converting between these variants presents challenges, such as reversing operand orders, adding/removing prefixes like % for registers in AT&T, adjusting memory notation from parentheses to brackets, and harmonizing directives (e.g., .data vs. section .data). Tools like syntax converters or manual rewriting are often required, as automated translation can introduce errors in complex addressing or macros.[24][25]
; Example in Intel/MASM syntax
.data
msg db "Hello", 0
.code
mov rax, offset msg ; Destination first, no % prefix
; Example in Intel/MASM syntax
.data
msg db "Hello", 0
.code
mov rax, offset msg ; Destination first, no % prefix
# Example in AT&T/GAS syntax
.data
msg: .ascii "Hello\0"
.text
movq $msg, %rax ; Source first, % prefix, $ for immediate[](https://cs61.seas.harvard.edu/site/2018/Asm1/)
# Example in AT&T/GAS syntax
.data
msg: .ascii "Hello\0"
.text
movq $msg, %rax ; Source first, % prefix, $ for immediate[](https://cs61.seas.harvard.edu/site/2018/Asm1/)
; Example in NASM syntax
section .data
msg db 'Hello', 0
section .text
mov rax, msg ; Square brackets for memory if needed
; Example in NASM syntax
section .data
msg db 'Hello', 0
section .text
mov rax, msg ; Square brackets for memory if needed
; Example in FASM syntax
section .data
msg db 'Hello',0
section .code
mov rax, msg ; = for equates, flat model
; Example in FASM syntax
section .data
msg db 'Hello',0
section .code
mov rax, msg ; = for equates, flat model
Mnemonics and Opcodes
In x86 assembly language, mnemonics serve as human-readable symbolic representations of machine instructions, such asMOV for data movement or ADD for arithmetic addition, which directly correspond to specific binary opcodes executed by the processor.[27] These opcodes are fixed binary values that define the operation, with examples including 0x89 for MOV from register to register or 0x01 for ADD from register to memory.[27] The mapping ensures that assemblers translate mnemonic-based source code into the processor's native binary format, maintaining compatibility across Intel 64 and IA-32 architectures.[27]
x86 instructions employ a variable-length encoding scheme, typically ranging from 1 to 15 bytes, comprising optional prefixes, one or more opcode bytes, a ModR/M byte (if required for operand specification), an optional Scale-Index-Base (SIB) byte, displacement fields, and immediate data.[27] The ModR/M byte, an 8-bit field, encodes addressing modes and operand selection using three subfields: Mod (2 bits for mode), Reg/Opcode (3 bits for register or extension), and R/M (3 bits for register or memory base).[27] This flexible structure allows efficient encoding of diverse operand types, from register-to-register operations to complex memory accesses.[27]
Opcode organization relies on hierarchical tables: primary opcodes use a single byte (e.g., 0x00 to 0xFF for basic operations like ADD), while secondary opcodes extend via a two-byte escape prefix such as 0x0F (e.g., 0F 01 for system instructions).[27] Further extensions include three-byte formats like 0F 38 or 0F 3A for advanced instructions (e.g., 0F 38 01 for packed horizontal addition).[27] Modern extensions differentiate legacy encodings from enhanced ones; for instance, the REX prefix (0x40 to 0x4F) in 64-bit mode extends operand sizes, adds high registers (R8-R15), and enables RIP-relative addressing.[27] Similarly, the VEX prefix (2- or 3-byte forms starting with 0xC4 or 0xC5) supports AVX vector instructions by embedding legacy prefixes and specifying vector lengths.[27]
Prefixes modify instruction behavior and contribute to variable length: the LOCK prefix (0xF0) ensures atomic operations on memory for multiprocessing synchronization, while REP (0xF3) or REPNE (0xF2) repeats string operations until a condition is met.[27] These elements allow instructions to adapt dynamically, such as a simple MOV r32, imm32 expanding to 5 bytes with opcode B8 plus the immediate value.[28]
Vendor-specific extensions introduce additional opcode spaces; AMD's 3DNow! uses a secondary escape sequence of 0x0F 0x0F followed by a ModR/M byte and an 8-bit immediate opcode (imm8) to encode SIMD floating-point operations, such as 0F 0F /r 9E for packed floating-point addition (PFADD).[29] This format reserves the imm8 for up to 256 unique operations, distinguishing it from Intel's SSE/AVX paths, though AMD now recommends migrating to standard vector extensions for broader compatibility.[29]
Disassembly tools like objdump from the GNU Binutils suite reverse this process, displaying both hexadecimal opcodes and corresponding mnemonics from object files or executables, as in objdump -d binary outputting lines like 89 c3: mov %eax,%ebx alongside the raw bytes.[30] This aids in verifying encodings and debugging low-level code.[30]
Reserved Words and Directives
In x86 assembly language, reserved words encompass identifiers that the assembler treats as fixed and cannot be redefined by the programmer, including register names and certain symbols, to prevent conflicts with the processor's architecture. These reservations ensure consistent interpretation during assembly, as redefining them can lead to syntax errors or unexpected behavior, such as failed compilation when attempting to use a register name as a variable.[31][32] Register names like EAX, ESP, and their variants (e.g., AH, AL, AX for 8-bit and 16-bit portions) are prime examples of reserved words across assemblers, as they directly map to hardware registers and cannot be reassigned without triggering assembly errors. In Microsoft Macro Assembler (MASM), the full list includes EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, and segment registers like CS, DS, SS, ES, FS, GS, all of which are protected under all CPU modes to maintain compatibility. Similarly, in the Netwide Assembler (NASM), registers such as RAX (in 64-bit mode) and their low/high byte variants are reserved, with legacy high bytes like AH inaccessible in certain 64-bit contexts via the REX prefix. Misuse, such as redefining EAX as a label, results in immediate assembly failure, emphasizing the need for programmers to avoid keyword conflicts.[31][32] Directives, also known as assembler pseudo-instructions, are non-executable commands that guide the assembly process, such as defining data, managing memory layout, or structuring code, and they vary slightly between assemblers like MASM and NASM. For data definition, common directives include DB (define byte), DW (define word, 2 bytes), and DD (define doubleword, 4 bytes), which allocate and initialize memory with specified values; for example,DB 42 reserves one byte with the value 42, while DD 0x12345678 reserves four bytes for a 32-bit integer. These are universal in x86 assemblers and essential for embedding constants or arrays without runtime overhead.[31][32]
Segment and layout directives control how code and data are organized in the output file. In MASM, SEGMENT (or SECTION) defines a memory segment, such as .DATA SEGMENT to group variables, and ASSUME specifies register-segment associations, like ASSUME DS:[DATA](/page/Data), to inform the assembler of addressing assumptions for optimization. NASM uses SECTION (or SEGMENT) similarly to switch between sections like .text for code or .bss for uninitialized data, with ORG setting the absolute origin address in flat binary outputs, e.g., ORG 0x1000 to start code at a specific offset. The INCLUDE directive, supported in both, incorporates external source files, e.g., INCLUDE "macros.inc", to modularize assembly. Improper use, such as mismatched ASSUME declarations, can cause linker errors or incorrect memory references during execution.[31][32]
Program structure directives mark the boundaries of code units. In MASM, PROC declares a procedure, e.g., main PROC, paired with ENDP to close it, enabling structured programming with local labels, while END signals the program's termination and optionally specifies an entry point like END main. NASM lacks native PROC/ENDP but uses %define for macro definitions, e.g., %define MAX 100, which acts as a text substitution for constants or simple macros without procedure semantics. These directives ensure proper scoping; for instance, omitting ENDP in MASM leads to unresolved symbol errors at assembly time. Assembler-specific variations, such as MASM's DUP for repeating data definitions (e.g., array DW 10 DUP(0)), highlight the need to consult variant-specific documentation to avoid portability issues.[31][32]
Processor Architecture
Registers
The x86 architecture features a diverse set of registers that form the core of its register file, enabling efficient data manipulation, memory addressing, and control of processor state across various operating modes. These registers have evolved from the original 16-bit design of the Intel 8086 to support 32-bit and 64-bit extensions, with additional specialized registers introduced through SIMD and other enhancements. The general-purpose, segment, control, and debug registers provide the foundational hardware for assembly programming, while the flags register captures execution status for conditional operations. The x87 floating-point unit (FPU) includes eight 80-bit floating-point registers organized as a stack (ST0 through ST7), along with control (FCW), status (FSW), tag (FTW), instruction pointer (FIP), data pointer (FDP), and opcode (FOp) registers for managing floating-point operations and exceptions.[1] General-purpose registers (GPRs) serve as the primary storage for operands, addresses, and computation results in x86 assembly. In the original 16-bit IA-32 architecture, there are eight 16-bit GPRs: AX, BX, CX, DX, SI, DI, BP, and SP, each of which can be accessed via 8-bit sub-registers for the high and low bytes (e.g., AH and AL for AX). These were extended to 32-bit registers in the 80386 processor (EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP), allowing larger data handling while maintaining backward compatibility through the lower 16- and 8-bit portions. In 64-bit mode (Intel 64), these expand to 64-bit registers (RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP) plus eight additional ones (R8 through R15), requiring the REX prefix for access to the new registers and full 64-bit widths; all GPRs support byte-level subregister access (e.g., AL, R8B), though the REX prefix is required for certain subregisters like SPL, BPL, SIL, DIL and the new registers R8B–R15B. The ESP/RSP register specifically functions as the stack pointer, while EBP/RBP acts as the base pointer for stack frames.[1][1]| Register Group | 16-bit | 32-bit | 64-bit | Key Roles |
|---|---|---|---|---|
| Accumulator | AX | EAX | RAX | Arithmetic, I/O operations |
| Base | BX | EBX | RBX | Base addressing, data storage |
| Counter | CX | ECX | RCX | Loop counters, shifts |
| Data | DX | EDX | RDX | I/O port addressing, multiplication/division |
| Source Index | SI | ESI | RSI | String source addressing |
| Destination Index | DI | EDI | RDI | String destination addressing |
| Base Pointer | BP | EBP | RBP | Stack frame base |
| Stack Pointer | SP | ESP | RSP | Stack top management |
| Additional (64-bit only) | - | - | R8–R15 | General data and addressing |
Memory Addressing
In x86 assembly language, memory addressing modes determine how operands are specified for instructions, allowing access to registers, immediate values, or memory locations. These modes provide flexibility in forming effective addresses, which are computed as offsets within a segment or linear addresses in flat models. The primary modes include immediate, register, direct, register indirect, and more complex forms combining base registers, indices, scales, and displacements.[33] Immediate addressing embeds a constant value directly in the instruction, used for operations like loading a literal into a register. For example,mov eax, 5 places the value 5 into the EAX register without referencing memory. Register addressing operates solely on processor registers, such as mov eax, ebx, which copies the contents of EBX to EAX. These modes are efficient as they avoid memory access.[33]
Direct addressing specifies a fixed memory address in the instruction, as in mov eax, [100h], where the contents at address 100h are loaded into EAX. Register indirect addressing uses a register to hold the memory address, for instance mov eax, [ebx], dereferencing the value in EBX as the location. Unlike some architectures like ARM, x86 does not support automatic pre- or post-increment in these indirect modes; increments require separate instructions such as INC.[33]
The most versatile mode is the base-plus-index-plus-scale-plus-displacement form, which computes the effective address as base register + (index register × scale) + displacement. Here, the base and index are general-purpose registers (e.g., EBX and ESI), the scale is 1, 2, 4, or 8 for array access, and the displacement is an optional constant. An example is mov eax, [ebx + esi*4 + 10h], useful for traversing data structures like arrays. In 64-bit mode, this mode supports 64-bit registers but limits displacements to 32 bits, sign-extended during calculation.[33]
RIP-relative addressing, available only in 64-bit mode, forms addresses relative to the instruction pointer (RIP) plus a 32-bit signed displacement, enabling position-independent code without absolute addresses. For example, mov eax, [rip + offset] loads from a location offset from the current instruction. This mode enhances portability in shared libraries.[33]
When operand sizes are ambiguous, especially for memory references, explicit size specifiers disambiguate the instruction. Directives like BYTE PTR for 8-bit, WORD PTR for 16-bit, or DWORD PTR for 32-bit ensure correct interpretation, as in mov byte ptr [esi], 5. Failure to specify can lead to assembler errors or unintended sizes.[33]
x86 supports both flat and segmented addressing models. In the flat model, prevalent in 64-bit mode, addresses are linear without segment bases (defaults to zero), simplifying access to a continuous address space. Segmented addressing, used in IA-32 real or protected modes, combines segment selectors with offsets but is detailed separately; the addressing modes here form the offset component in both cases.[33]
Segmented Memory Model
The segmented memory model in x86 architecture divides the memory address space into variable-sized segments to facilitate addressing beyond the limitations of early processors. In real mode, which is the default execution mode upon processor reset and emulates the 8086 environment, memory addressing employs a 20-bit physical address space calculated using a segment:offset pair.[33] The segment register, such as CS for code or DS for data, holds a 16-bit value that is shifted left by 4 bits (multiplied by 16) and added to a 16-bit offset to yield the effective address, allowing access to up to 1 MB of memory while each segment is limited to 64 KB.[33] For instance, the instruction pointer IP combined with the code segment CS forms the program counter as CS * 16 + IP.[33] In protected mode, introduced with the Intel 80286 and expanded in subsequent processors, the segmented model evolves to support memory protection, larger address spaces, and multitasking through descriptor tables.[33] The Global Descriptor Table (GDT) provides system-wide segment definitions, while the Local Descriptor Table (LDT) allows task-specific segments, both loaded into memory and referenced by the GDTR and LDTR registers, respectively.[33] Each segment descriptor is an 8-byte structure containing a base address (up to 4 GB in 32-bit mode), a limit defining the segment size (expandable via granularity bits to 4 GB), and access rights including privilege levels (0-3 for ring protection), type (code, data, stack), and attributes like readability or writability.[33] Segment registers in protected mode hold 16-bit selectors that index into the GDT or LDT to retrieve the corresponding descriptor, enabling dynamic segment relocation and protection checks.[33] A selector comprises a 13-bit index, a 1-bit Table Indicator (TI) to distinguish GDT (TI=0) from LDT (TI=1), and a 2-bit Requestor Privilege Level (RPL) for access validation against the descriptor's privilege.[33] Upon loading a selector, the processor uses the descriptor's base and limit to compute the linear address as base + offset, with violations triggering exceptions like general-protection (#GP) for out-of-limit accesses or privilege mismatches.[33] In 32-bit and 64-bit modes, modern operating systems typically adopt a flat memory model that minimizes segmentation's complexity by using a single, continuous address space spanning 0 to 4 GB in 32-bit protected mode or 0 to 2^64 bytes in 64-bit long mode.[33] This is achieved by configuring segment descriptors with a base address of 0 and a limit of 4 GB (or unlimited in 64-bit), effectively ignoring segmentation for most operations while retaining the mechanism for compatibility.[33] Exceptions include the FS and GS segments, which can have non-zero bases to support thread-local storage (TLS) and other OS-specific uses without altering the flat addressing for code, data, and stack.[33] The segmented model's legacy from real mode introduces challenges, such as wraparound behavior where offsets exceeding 64 KB modulo back to 0, potentially causing unintended overlaps between segments and complicating legacy code porting.[33] These issues persist for backward compatibility with 8086 software, requiring careful handling in emulators or mode transitions to avoid faults like invalid memory accesses.[33]Operating Modes
Real Mode
Real mode, also known as real-address mode, is the default operating mode for x86 processors upon power-on reset or boot, providing backward compatibility with the original Intel 8086 architecture.[33] In this environment, the processor uses a segmented memory model with 16-bit segment registers and 16-bit offsets to form 20-bit physical addresses, limiting the addressable memory space to 1 MB (from 0x00000 to 0xFFFFF).[33] The physical address is calculated by shifting the 16-bit segment value left by 4 bits (multiplying by 16) and adding the 16-bit offset, with no memory protection mechanisms in place, allowing unrestricted access to the full address space at privilege level 0. Interrupt handling in real mode relies on the Interrupt Vector Table (IVT), a fixed structure located at physical address 0000:0000 (the first 1 KB of memory), containing 256 four-byte entries that point to interrupt service routines.[33] This setup enables direct invocation of BIOS and DOS services through software interrupts, as seen in traditional MS-DOS programming where applications interact with hardware via standardized interrupt vectors such as INT 21h for DOS functions and INT 13h for disk operations.[34] Real mode imposes several key limitations suited to early 16-bit systems. It supports no native multitasking, as there are no privilege rings or task switching mechanisms, and all code executes with equal access to memory and hardware ports. Segments are capped at 64 KB in size and must align on 16-byte boundaries, restricting code and data blocks while permitting direct I/O operations without mediation, which facilitates low-level hardware control but risks system instability.[33] In contemporary systems, real mode persists primarily for compatibility in bootloaders, such as the initial stage of GNU GRUB on x86 platforms, where it loads the core image and modules before transitioning to higher modes.[35] It also remains relevant for embedded applications running under legacy MS-DOS environments, enabling direct hardware manipulation in resource-constrained settings like industrial controllers or vintage software emulation.[34] To exit real mode and enter protected mode, software must first initialize a Global Descriptor Table (GDT) and then execute the LMSW (Load Machine Status Word) instruction to set the Protection Enable (PE) bit in the CR0 register, enabling memory protection and expanded addressing.Protected Mode
Protected mode is a 32-bit operational mode of the x86 architecture introduced with the Intel 80386 processor, enabling advanced memory management, protection mechanisms, and support for multitasking.[36] It is activated from real mode by setting the Protection Enable (PE) bit (bit 0) in the CR0 control register using a MOV CR0 instruction, followed by a far jump or intersegment return to load a code segment selector from the Global Descriptor Table (GDT).[36] The GDT, loaded into the GDTR register via the LGDT instruction, contains segment descriptors that define up to 4 GB of linear address space through base addresses, limits (up to 4 GB per segment with granularity extensions), and access rights.[36] This segmentation allows logical addresses (segment selector + offset) to be translated into linear addresses, providing a foundation for protected execution.[36] A key feature of protected mode is its hierarchical protection rings, which enforce privilege levels to isolate code execution and prevent unauthorized access to system resources.[36] There are four rings (0 to 3), with Ring 0 designated for the most privileged kernel-mode code and Ring 3 for least-privileged user-mode applications.[36] The Current Privilege Level (CPL), encoded in bits 0-1 of the CS and SS segment registers, determines the executing ring, while the Descriptor Privilege Level (DPL) in segment descriptors and the Requested Privilege Level (RPL) in selectors govern access checks.[36] Privilege transitions, such as from Ring 3 to Ring 0, are controlled through mechanisms like call gates, interrupt gates, and task gates, which validate levels before allowing sensitive operations like system calls.[36] Virtual memory in protected mode is implemented via paging, which maps linear addresses to physical addresses for abstraction and isolation.[36] Paging is enabled by setting the PG bit (bit 31) in CR0, after which the CR3 register points to the base of a page directory containing 1024 entries, each referencing a page table with another 1024 entries for 4 KB pages.[36] A linear address is divided into three parts: a directory index (bits 31-22), a table index (bits 21-12), and a page offset (bits 11-0), enabling up to 4 GB of virtual address space per process.[36] The Translation Lookaside Buffer (TLB), a hardware cache, stores recent address translations to accelerate paging operations and reduce latency.[36] Multitasking support in protected mode relies on the Task State Segment (TSS) for context switching between tasks and the Interrupt Descriptor Table (IDT) for handling interrupts.[36] The TSS, described in the GDT or LDT as a system segment, stores the complete state of a task, including general-purpose registers, segment registers, and stack pointers for each privilege level (0-2), and is loaded into the task register via the LTR instruction.[36] Task switches occur via the CALL, JMP, IRET, or exception/interrupt mechanisms, saving the current task state to its TSS and loading the new one.[36] The IDT, loaded via LIDT into the IDTR register, contains up to 256 interrupt vectors, each as a gate descriptor (task, interrupt, or trap gate) that directs control to handlers, often in Ring 0, with privilege checks enforced.[36] In practice, operating systems such as Windows and Linux utilize protected mode with a flat memory model, where segment registers are set to cover the entire 4 GB linear address space (base 0, limit 4 GB), minimizing segmentation overhead while relying on paging for process isolation and virtual memory management.[36][37] This approach allows each process to have its own page directory for isolated virtual address spaces, enabling secure multitasking without complex segment usage.[36]Long Mode
Long Mode, also known as 64-bit mode within the x86-64 architecture, represents the core extension introduced by AMD to enable full 64-bit processing on x86 processors, first implemented in the AMD Opteron in 2003. This mode expands the address space to 64-bit virtual addresses, though current implementations use 48-bit effective addressing with higher bits sign-extended for canonical form, allowing access to up to 256 terabytes of virtual memory per process. General-purpose registers are widened to 64 bits (e.g., RAX, RBX), and eight additional 64-bit registers (R8 through R15) are provided to support more efficient 64-bit computation without legacy 32-bit constraints. RIP-relative addressing further enhances this mode by permitting memory operands to be offset from the instruction pointer (RIP), facilitating position-independent code commonly used in modern shared libraries and executables. Long Mode operates in two sub-modes to balance new capabilities with legacy support: 64-bit mode for native execution of 64-bit instructions and applications, and compatibility mode, which allows unmodified 32-bit and 16-bit protected-mode code to run under a 64-bit operating system by emulating the protected-mode environment (e.g., default address size of 32 bits or 16 bits). Canonical addressing enforces validity by requiring all virtual addresses to lie within the signed range from -247 to 247 - 1, where bits 63 through 48 must replicate the sign of bit 47; non-canonical addresses trigger general-protection faults to prevent invalid memory access. In 64-bit mode, the segmented memory model is simplified to a flat address space, with most segment registers (CS, DS, ES, SS) ignored and treated as having base address 0 and limit 264 - 1, eliminating the need for segment descriptors in user code. Exceptions are the FS and GS segments, which remain functional for thread-local storage and can specify 64-bit base addresses loaded via model-specific registers such as FS_BASE (MSR C000_0100h) and GS_BASE (MSR C000_0101h). Paging is required for all operations in Long Mode and mandates the use of Physical Address Extensions (PAE), employing four-level page tables to map 48-bit virtual addresses or optional five-level page tables (supported since 2017 in Intel processors and widely adopted by 2025) to map 57-bit virtual addresses, to up to 52-bit physical addresses, with support for 4 KB, 2 MB, and 1 GB page sizes. Adoption of Long Mode accelerated with major operating systems: the Linux kernel introduced x86-64 support in version 2.6.0, released on December 17, 2003, enabling widespread use in distributions by 2004. Microsoft followed with Windows XP Professional x64 Edition, released on April 25, 2005, marking the first consumer x86-64 version of Windows and building on earlier server editions from 2003.[38]Mode Transitions
Mode transitions in x86 assembly language involve precise sequences of instructions to switch between operating modes, ensuring compatibility with the processor's state and avoiding exceptions. These transitions are critical for bootloaders and operating system kernels, as they enable access to advanced features like protected memory and 64-bit addressing while maintaining backward compatibility. The process typically requires configuring control registers, loading descriptor tables, and executing jumps to update the processor's execution environment.[39] The transition from real mode to protected mode begins with enabling the A20 address line to access memory above 1 MB, followed by loading the Global Descriptor Table (GDT) using theLGDT instruction to specify its base address and limit. Interrupts are disabled (CLI) to prevent interference, and the protection enable (PE) bit in CR0 is set to 1 via MOV CR0, eax (with the appropriate value in EAX). A far jump (JMP FAR) or intersegment return (IRET) is then executed to load a valid 32-bit code segment selector into CS from the GDT, flushing the instruction prefetch queue and switching the processor to protected mode. Finally, other segment registers (DS, SS, ES, FS, GS) are loaded with appropriate selectors, and the Interrupt Descriptor Table (IDT) is loaded using LIDT. This sequence allows the use of segmented memory and privilege levels.[39][39]
Switching from protected mode to long mode (IA-32e mode) requires first enabling Physical Address Extension (PAE) by setting the PAE bit in CR4 to 1. The CR3 register is loaded with the physical address of the Page Directory Pointer Table (PDPT), which contains pointers to page directories for 64-bit paging. The long mode enable (LME) bit in the Extended Feature Enable Register (EFER) is set to 1 using a model-specific register write. Paging is then enabled by setting the PG bit in CR0 to 1, and a far jump is performed to a 64-bit code segment selector (with the L bit set in the GDT descriptor) to enter 64-bit submode. These steps establish four-level paging and RIP-relative addressing.[39]
Transitioning from 64-bit mode to 32-bit compatibility mode within long mode occurs by loading a code segment descriptor with the L bit cleared (indicating 32-bit operation) via a far return (RETF) or interrupt return (IRET) instruction, using a selector from the GDT or LDT that points to a compatibility-mode code segment. Alternatively, the SYSCALL instruction can invoke a 32-bit handler if configured. This allows legacy 32-bit code to execute without leaving long mode, preserving the paging and segment structures.[39]
Invalid mode transitions can trigger exceptions, such as a general-protection fault (#GP) from malformed GDT entries or a page fault (#PF) from invalid paging setups, potentially escalating to a double fault (#DF) if the handler fails. A triple fault results when the double-fault handler itself causes an exception (e.g., due to an invalid IDT entry or stack overflow), leading to a processor shutdown and system reset with no software recovery possible. In real-mode transitions, failing to enable the A20 line risks address wraparound, corrupting data access above 1 MB.[39][39]
Initial mode handling is managed by firmware: traditional BIOS initializes the processor in real mode, loading the boot sector at 0x7C00 and requiring bootloader intervention for transitions. UEFI firmware, in contrast, operates in long mode from the start on x86-64 systems, providing a PE/COFF loader for boot applications and handling initial paging and descriptor setup before transferring control.[40][40]
Instruction Set
Data Movement Instructions
Data movement instructions in x86 assembly language facilitate the transfer of data between registers, memory locations, and immediate values, forming the foundation for data manipulation without performing arithmetic or logical operations. These instructions support various operand sizes, including bytes, words (16 bits), doublewords (32 bits), and quadwords (64 bits) in 64-bit mode, and adhere to the processor's addressing modes for efficient memory access. They are essential for initializing variables, passing parameters, and managing data flow in programs, with operations typically not affecting the processor's flags unless specified otherwise.[41] The MOV instruction performs a general-purpose data transfer, copying the contents of the source operand to the destination operand while leaving the source unchanged. It supports transfers between registers (e.g., MOV EAX, EBX), from memory to registers or vice versa (e.g., MOV EAX, [EBX]), and from immediate values to registers or memory (e.g., MOV EAX, 42), but does not allow immediate-to-immediate or segment register as a source in register-to-segment transfers. In 64-bit mode, MOV operates on 64-bit registers like RAX, and it does not affect any flags. For example, the assembly codeMOV ECX, [EAX + 4] loads a 32-bit value from the memory address EAX + 4 into ECX, leveraging scaled-index addressing modes. MOV ensures data integrity during transfers and can be prefixed with LOCK for atomicity in multiprocessor environments when accessing memory.[41]
PUSH and POP instructions handle stack-based data movement, automatically adjusting the stack pointer (ESP in 32-bit mode or RSP in 64-bit mode) to push or pop values onto or from the stack. PUSH decrements the stack pointer by the operand size (e.g., 8 bytes for quadwords in 64-bit mode) and stores the source operand (register, memory, or immediate) at the new top of the stack, as in PUSH EAX, which saves the value of EAX before a subroutine call. Conversely, POP loads the value from the top of the stack into the destination operand (register or memory) and increments the stack pointer, restoring the saved value with POP EAX after the subroutine returns. These instructions do not affect flags and are crucial for function calls, local variable allocation, and interrupt handling, with PUSH supporting immediate values up to 32 bits even in 64-bit mode. In stack overflow scenarios, they rely on the operating system's stack limits for protection.[41]
The XCHG instruction exchanges the contents of two operands atomically, swapping a register with another register or with a memory location, which is particularly useful for implementing locks in multithreaded applications. For instance, XCHG EAX, EBX interchanges the values in EAX and EBX, while XCHG EAX, [MEM] swaps EAX with the memory at address MEM. It supports byte, word, doubleword, or quadword sizes, with the LOCK prefix ensuring atomic operation on memory operands in multiprocessor systems by preventing other processors from reading or writing the location during the exchange. XCHG does not affect flags and requires at least one operand to be a register, making it efficient for semaphore operations without additional synchronization primitives. In 64-bit mode, it operates on 64-bit registers like RAX.[41]
LEA (Load Effective Address) computes the effective address of a memory operand and stores it in a register without accessing the memory itself, enabling efficient address arithmetic such as scaling and indexing. An example is LEA EAX, [EBX + 4*ECX], which calculates the address EBX + 4*ECX and loads it into EAX, useful for pointer manipulation or array indexing. It supports all addressing modes, including displacement, base, index, and scale, but treats the operand as an address expression rather than dereferencing it. LEA does not affect flags and is available in 32-bit and 64-bit modes, where it can produce 64-bit addresses in registers like RAX. This instruction optimizes code by combining multiple ADD operations into a single instruction, though it cannot load segment registers.[41]
String movement instructions like MOVS and LODS enable efficient block transfers of data using dedicated index registers (ESI/RSI for source and EDI/RDI for destination in 64-bit mode), with the direction determined by the DF (Direction Flag) in the EFLAGS register. MOVS copies a byte, word, doubleword, or quadword from the source string (at [RSI]) to the destination string (at [RDI]), then auto-increments or decrements the pointers based on DF (forward if DF=0, backward if DF=1), as in MOVS DWORD PTR [EDI], DWORD PTR [ESI]. The REP prefix repeats the operation ECX/RCX times, decrementing the counter until zero, making it ideal for memcpy-like operations on large buffers. Similarly, LODS loads a string element from [RSI] into AL/AX/EAX/RAX and updates RSI, with REP LODS loading sequential elements into the accumulator for processing. These instructions support byte-level alignment and can be combined with segment overrides, but require explicit size prefixes (e.g., BYTE PTR) for clarity; in 64-bit mode, they handle up to quadwords with 64-bit indices. They do not affect arithmetic flags, focusing purely on data relocation.[41]
Arithmetic and Logic Instructions
The arithmetic and logic instructions in x86 assembly language form the core of integer computations performed by the arithmetic logic unit (ALU), operating on registers, memory, or immediate values while updating status flags in the EFLAGS register to indicate results such as zero, sign, carry, and overflow. These instructions support both unsigned and signed operations, with flag updates enabling conditional branching for error handling and flow control. Unlike data movement instructions, which merely transfer values, arithmetic and logic operations modify operands to produce new results, often with multi-byte handling for extended precision.[41] Addition instructions include ADD, which adds the source operand to the destination operand and stores the result in the destination, setting the carry flag (CF) if there is a carry out of the most significant bit and the overflow flag (OF) for signed overflow. The ADC variant extends this by adding the carry flag from a previous operation, facilitating multi-precision arithmetic; for example, in 32-bit mode,ADC EAX, EBX adds EBX and CF to EAX, updating flags including auxiliary carry (AF) for BCD arithmetic. Both instructions affect parity (PF), sign (SF), and zero (ZF) flags based on the result, with operands sized from 8 to 64 bits depending on mode and prefixes.[41]
Subtraction mirrors addition with SUB, subtracting the source from the destination and storing the result in the destination, setting CF for borrow and OF for signed underflow. The SBB form subtracts the source and CF (as borrow) from the destination, essential for chained subtractions; for instance, SBB EAX, EBX computes EAX - EBX - CF, preserving flags for subsequent operations in multi-word subtraction. These instructions clear no flags inherently but set them according to the arithmetic outcome, supporting atomic operations via the LOCK prefix in protected mode.[41]
Multiplication instructions handle unsigned and signed integers using the accumulator registers. MUL performs unsigned multiplication: for byte operands, it multiplies AL by the source and stores the 16-bit result in AX; for word, AX by source into DX:AX; and for doubleword, EAX by source into EDX:EAX, setting CF and OF if the high half is nonzero. The signed counterpart IMUL supports one, two, or three operands—for two-operand form, it multiplies source by destination (e.g., register or memory) and stores in destination, or for one-operand, accumulator by source into accumulator pair—setting CF and OF if the result does not fit in the destination (i.e., high bits are not sign-extended). In 64-bit mode, REX.W extends to RAX and RDX:RAX.[41]
Division instructions divide the accumulator by the source, producing quotient and remainder without affecting most flags. DIV is unsigned: for byte, AX divided by source yields quotient in AL and remainder in AH; for word, DX:AX by source into AX (quotient) and DX (remainder); doubleword uses EDX:EAX similarly, raising a divide-error exception (#DE) on division by zero or quotient overflow. Signed division via IDIV follows the same register conventions but uses two's-complement arithmetic, also triggering #DE on invalid results like zero divisor or out-of-range quotient. These are slower than multiplication due to iterative algorithms in early implementations, though modern processors optimize them.[41]
Shift instructions manipulate bit positions for scaling, alignment, or extraction. SHL (or synonym SAL) shifts the destination left by a count in CL or immediate (1-31 bits), filling with zeros and setting CF to the last shifted-out bit; for single-bit shifts, OF indicates sign-bit change. SHR shifts right logically, filling the high bit with zero and setting CF to the shifted-out bit, with OF cleared for multi-bit or set based on sign change for one bit. Arithmetic right shift SAR preserves the sign bit when filling, ideal for signed division by powers of two, clearing OF and setting CF similarly. Rotate variants ROL and ROR shift bits circularly without loss, moving the overflow bit into CF; for example, ROL EAX, 1 rotates left, with CF receiving the original MSB. All affect SF, ZF, and PF, but undefined AF, and counts modulo operand size to avoid excess shifts.[41]
Logical instructions perform bitwise operations, typically clearing CF and OF while setting other flags per result. AND computes the bitwise AND of source and destination, storing in destination and setting ZF if zero; it masks bits, useful for clearing flags or testing. OR performs bitwise OR, setting bits where either operand has a 1, and XOR exclusive-OR toggles differing bits—XOR EAX, EAX clears EAX to zero. NOT inverts all bits in the destination without flag changes, serving as a unary complement. TEST ANDs source and destination but discards the result, solely updating flags for conditional checks, such as TEST EAX, 1 to probe the least significant bit. These operate on any operand size and support memory access.[41]
Overflow handling relies on the OF flag, set by signed arithmetic instructions like ADD, SUB, IMUL when the result's sign differs from expected (e.g., positive + positive yielding negative). The JO instruction jumps if OF is 1, branching to an overflow handler, while JNO jumps if OF is 0 to continue normal execution; both use relative offsets (short or near) without modifying flags. For example, following ADD EAX, EBX, JO overflow_label detects signed overflow, ensuring program robustness in integer computations.[41]
; Example: Multi-precision addition with overflow check
ADD EAX, EBX ; Add low words, set flags
ADC EDX, ECX ; Add high words + carry
JO overflow_handler ; Jump if signed overflow
; Example: Multi-precision addition with overflow check
ADD EAX, EBX ; Add low words, set flags
ADC EDX, ECX ; Add high words + carry
JO overflow_handler ; Jump if signed overflow
Control Flow Instructions
Control flow instructions in x86 assembly language enable dynamic alteration of program execution by transferring control to different addresses, either unconditionally or based on processor flags set by prior arithmetic or logic operations. These instructions are essential for implementing conditional logic, procedure calls, loops, and interrupt handling in both IA-32 and Intel 64 architectures. They operate by modifying the instruction pointer (IP, EIP, or RIP) and, in some cases, the code segment register (CS), supporting both near transfers (within the same code segment) and far transfers (across segments in non-flat memory models like real or protected mode).[41]Unconditional Transfers
Unconditional jumps, calls, and returns provide direct control flow changes without testing conditions. The JMP instruction transfers execution to a specified target address, either near (updating only IP/EIP/RIP) or far (also loading a new CS value in segmented modes). Near JMP supports immediate, register, or memory operands, while far JMP uses a pointer operand for segment:offset addressing. Neither variant affects flags. For example:JMP rel32 ; Relative jump by 32-bit signed displacement
JMP FAR ptr16:32 ; Far jump to segment:offset
JMP rel32 ; Relative jump by 32-bit signed displacement
JMP FAR ptr16:32 ; Far jump to segment:offset
CALL near_proc ; Near call, pushes EIP/RIP
RET 8 ; Near return, pops EIP/RIP and adds 8 to RSP
CALL near_proc ; Near call, pushes EIP/RIP
RET 8 ; Near return, pops EIP/RIP and adds 8 to RSP
Conditional Branches
Conditional jump instructions (Jcc) branch to a target only if a specific flag condition is met, facilitating if-then-else constructs and decision-making. They use relative displacements (8-, 16-, or 32-bit signed) and do not alter flags themselves. Common variants include JZ (jump if zero flag ZF=1, after operations like CMP yielding equality) and JNZ (ZF=0, for inequality); JC (carry flag CF=1, e.g., after unsigned overflow) and JNC (CF=0); as well as signed comparisons like JG (greater: ZF=0 and SF=OF for no overflow in signed arithmetic) and JL (less: SF≠OF). For instance:CMP EAX, EBX ; Sets flags based on EAX - EBX
JG positive ; Jump if EAX > EBX (signed)
JNZ not_equal ; Jump if EAX != EBX
CMP EAX, EBX ; Sets flags based on EAX - EBX
JG positive ; Jump if EAX > EBX (signed)
JNZ not_equal ; Jump if EAX != EBX
Loops
Loop instructions simplify repetitive execution by combining counter decrement with conditional jumps. The LOOP instruction decrements the ECX (32-bit) or RCX (64-bit) register and jumps to a label if the counter is non-zero, providing a basic counted loop without flag involvement. It uses a relative 8-bit displacement and is supported in IA-32 and Intel 64 modes. Example:MOV ECX, 10 ; Set loop count
loop_start:
; Loop body
LOOP loop_start ; Decrement ECX, jump if !=0
MOV ECX, 10 ; Set loop count
loop_start:
; Loop body
LOOP loop_start ; Decrement ECX, jump if !=0
REP MOVSB ; Copy ECX bytes from [ESI] to [EDI]
REPE CMPSB ; Compare bytes until mismatch or ECX=0
REP MOVSB ; Copy ECX bytes from [ESI] to [EDI]
REPE CMPSB ; Compare bytes until mismatch or ECX=0
Interrupts
Interrupt instructions handle software-generated exceptions and returns from handlers. INT n causes a software interrupt by pushing the current flags, CS, and EIP/RIP onto the stack, clearing the interrupt flag (IF), and jumping to the vector at interrupt number n (0-255), which indexes the interrupt descriptor table. It supports immediate 8-bit n and operates in all modes, though vector handling differs (e.g., IDT in protected mode). Example:INT 21h ; DOS interrupt (legacy)
INT 21h ; DOS interrupt (legacy)
Stack Instructions
The stack in x86 architecture serves as a last-in, first-out (LIFO) data structure primarily used for temporary storage during procedure calls, local variable allocation, and parameter passing. Stack instructions manage this structure by manipulating the stack pointer (SP or ESP/RSP depending on mode) and facilitating stack frame creation for function prologs and epilogs. These operations ensure efficient memory management without direct address calculations, leveraging the hardware-supported stack segment (SS).[42] The PUSH instruction decrements the stack pointer by the size of the operand (2, 4, or 8 bytes in 16-, 32-, or 64-bit modes, respectively) and stores the source operand at the new top of the stack. For example, in 32-bit mode,PUSH EAX first subtracts 4 from ESP, then writes the value of EAX to memory at [ESP]. This instruction supports immediate values, registers, or memory operands but does not affect the flags register. Variants like PUSHF (or PUSHFD/PUSHFQ) push the flags register onto the stack for preservation during interrupts or context switches. Additionally, PUSHAD (32-bit) and PUSHFQ (64-bit) push all general-purpose registers or flags, respectively, enabling atomic register saves.[42]
Conversely, the POP instruction loads the value from the top of the stack into the destination operand and then increments the stack pointer by the operand size. For instance, POP EAX reads the 4-byte value at [ESP] into EAX and adds 4 to ESP in 32-bit mode. Like PUSH, it supports registers or memory but cannot pop into the CS segment register; instead, RET is used for control transfers involving CS. The POPF (or POPFD/POPFQ) variant restores the flags register, while POPAD (32-bit) and POPFQ (64-bit) restore all general-purpose registers or flags, providing symmetric bulk operations to PUSH counterparts. These instructions also do not modify flags except when popping them explicitly.[43]
For procedure management, the ENTER instruction establishes a stack frame by pushing the frame pointer (EBP/RBP), allocating space for local variables based on a specified size, and handling nesting levels for languages like Pascal with recursive calls. It takes two operands: the allocation size (in bytes) and a nesting level (0-31), adjusting EBP to point to the frame base and reserving space on the stack. The companion LEAVE instruction reverses this by restoring the stack pointer from the frame pointer (MOV ESP, EBP) and popping EBP, effectively deallocating the frame just before a RET. This pair simplifies prologue/epilogue code compared to manual PUSH/MOV/SUB and POP/MOV sequences, though modern compilers often use the latter for optimization. For example, ENTER 8, 0 in 32-bit mode pushes EBP, sets EBP to ESP, and subtracts 8 from ESP for two local dwords.[44]
In 64-bit mode under the System V ABI (common on Linux/Unix), the stack must maintain 16-byte alignment upon function entry to optimize SIMD operations and reduce alignment faults; this requires padding if necessary during pushes or allocations. The ABI specifies that the stack pointer (RSP) modulo 16 equals 0 at the start of each function, with the calling convention ensuring alignment after the return address push. Misalignment can degrade performance or cause exceptions in aligned instructions like MOVAPS.[45]
Stack overflow occurs when PUSH or ENTER exceeds the stack segment limit or page boundaries, triggering a #SS (stack segment) exception in protected or long mode; underflow from excessive POP or LEAVE attempts accesses invalid memory, potentially causing a #GP (general protection) fault. These hardware-detected conditions rely on segment descriptors and page tables rather than EFLAGS bits like overflow (OF) or carry (CF), which apply to arithmetic operations. Detection integrates with the OS for handling, such as expanding the stack or terminating the process.
Floating-Point Instructions
The x87 floating-point unit (FPU) provides scalar floating-point operations in x86 assembly language, integrated into the processor since the 8087 coprocessor and later embedded in the CPU core. It employs a stack-based architecture with eight 80-bit registers, denoted ST(0) through ST(7), where ST(0) serves as the top of the stack (TOS). Each register holds data in extended-precision format: a 1-bit sign, a 15-bit biased exponent, and a 64-bit significand (with an explicit leading 1 for normalized numbers). The stack pointer TOP, stored in bits 11-13 of the FPU status word, dynamically indicates the current TOS, allowing implicit operand addressing relative to ST(0). The tag word tracks the content type of each register (valid, zero, special, or empty) to optimize operations and exception handling.[46] Basic arithmetic instructions in the x87 FPU perform operations primarily on the TOS and the next stack element, ST(1), with results replacing the TOS unless specified otherwise. The FADD instruction adds the source operand (ST(i) or memory) to ST(0), storing the result in ST(0); for example,FADD ST(1), ST(0) computes ST(0) + ST(1) and places it in ST(0). Similarly, FSUB subtracts the source from ST(0), FMUL multiplies them, and FDIV divides ST(0) by the source, each with variants like FADDP that pop the stack post-operation to free ST(1). These instructions support real operands in single (32-bit), double (64-bit), or extended (80-bit) precision, using the FPU's internal 80-bit format for computations to minimize rounding errors. Opcodes vary by operand type, such as D8 /0 for FADD with a 32-bit memory operand or DC C0+i for register-to-register.[41]
For storing results, the FST instruction copies the TOS to a destination without altering the stack, such as FST m64fp to write ST(0) as a 64-bit double-precision value to memory; the popping variant FSTP additionally decrements the stack pointer. These operations ensure compatibility with IEEE 754 formats when interfacing with memory, though internal computations retain extended precision for accuracy. Transcendental instructions compute specialized functions on the TOS. FSIN calculates the sine of ST(0) in radians (range -2^63 to +2^63), replacing ST(0) with the result and setting the C2 flag for out-of-range inputs; FCOS does likewise for cosine. FATAN computes the arctangent of ST(1)/ST(0), stores it in ST(1), and pops the stack, useful for angle computations with accuracy better than 1 ulp on Pentium processors and later.[41][46]
Comparison instructions like FCOM evaluate the TOS against a source operand, setting condition codes C0, C2, and C3 in the status word to indicate relations: C3=0 and C2=0 for ST(0) > source, C3=1 and C2=0 for ST(0) < source, C3=0 and C2=1 for equality, or unordered (NaN) otherwise. For instance, FCOM ST(1) compares ST(0) and ST(1), raising an invalid-operation exception if either is NaN. This enables conditional branching via subsequent instructions like FSTSW to transfer flags to the EFLAGS register. Control instructions manage FPU state: FINIT initializes the FPU by setting the control word to 037FH (masking all exceptions, rounding to nearest), clearing the status word, and tagging all registers as empty; FCLEX (or FNCLEX without wait) clears pending exception flags in the status word after checking for unmasked exceptions.[41]
| Instruction | Primary Operation | Key Flags/Effects | Example Usage |
|---|---|---|---|
| FADD | Addition | Updates C1 for inexact results | [FADD](/page/FADD) ST(2), ST(0) (ST(0) += ST(2)) |
| FSUB | Subtraction | As above | FSUBR ST(0), m32fp (ST(0) = memory - ST(0), reverse subtract) |
| FMUL | Multiplication | As above | FMULP ST(1), ST(0) (pops after multiply) |
| FDIV | Division | As above | FDIV ST(3), ST(0) (ST(3) /= ST(0)) |
| FST | Store TOS | No stack pop | FSTSW AX (store status word) |
| FSIN | Sine | C2=1 if out-of-range | FSIN (ST(0) = sin(ST(0))) |
| FCOS | Cosine | C2=1 if |ST(0)| ≥ 2^63 | FCOS (ST(0) = cos(ST(0))) |
| FATAN | Arctangent | Pops stack | FATAN (ST(1) = atan(ST(1)/ST(0))) |
| FCOM | Compare | Sets C0/C2/C3 | FCOM m80fp (compare to extended memory) |
| FINIT | Initialize | Resets to default | FINIT (clear exceptions, empty stack) |
| FCLEX | Clear exceptions | Clears flags | FCLEX (reset after error) |
SIMD Instructions
SIMD (Single Instruction, Multiple Data) instructions in x86 assembly language enable parallel processing of multiple data elements within a single operation, significantly enhancing performance for vectorized computations. These extensions build upon the scalar floating-point capabilities by introducing wider registers and specialized operations for packed data types, such as integers and floating-point values. Introduced progressively since the late 1990s, SIMD instructions form a cornerstone of high-performance computing on x86 processors.[41] The earliest SIMD extension, MMX (MultiMedia eXtension), introduced in 1997, provides operations on 64-bit MMX registers (MM0 through MM7, aliasing the x87 FPU registers) for packed integers. It supports data types like 8 packed bytes, 4 packed words, 2 packed doublewords, or a single quadword, with instructions such as PADDB (add packed bytes with saturation), PMULHW (multiply packed words, high part), and MOVQ (move quadword). MMX enables parallel integer arithmetic, logical operations, and shuffles for multimedia tasks like image processing, but requires EMMS to clear FPU tags after use to avoid conflicts with floating-point code. It laid the groundwork for later SIMD sets but is limited to 64-bit width.[41] The foundational SIMD extension for floating-point, Streaming SIMD Extensions (SSE), utilizes 128-bit XMM registers (XMM0 through XMM15 in 64-bit mode) to handle packed data. SSE supports operations on single-precision floating-point (32-bit) and integer vectors, with key instructions including MOVAPS for aligned moves of packed single-precision floating-point values and ADDPS for adding such vectors element-wise. For example, the instructionADDPS xmm1, xmm2 adds the packed single-precision values in xmm2 to those in xmm1, storing the result in xmm1. SSE instructions use legacy SSE opcodes and are essential for basic vector processing.[41]
Advanced Vector Extensions (AVX) extend SIMD capabilities to 256-bit YMM registers (YMM0 through YMM15), doubling the vector width for greater throughput. AVX employs the VEX encoding prefix (2- or 3-byte) to specify vector length and operands, avoiding legacy SSE escape bytes. Instructions like VADDPD add packed double-precision floating-point (64-bit) values, as in VADDPD ymm1, ymm2, ymm3, which processes eight elements simultaneously. AVX also supports masking via the VEX.vvvv field for conditional operations. Building on SSE, AVX includes integer instructions such as PACKSSDW, which packs signed doublewords into signed words with saturation (e.g., VPACKSSDW ymm1, ymm2, ymm3), useful for data compression in signal processing. Additionally, PSHUFB shuffles bytes based on a control mask (e.g., VPSHUFB ymm1, ymm2, ymm3), enabling flexible data permutation for tasks like byte-level reordering.[41]
AVX-512 further advances to 512-bit ZMM registers (ZMM0 through ZMM31), supporting up to 16 single-precision or 8 double-precision elements per operation. It introduces the EVEX encoding (4-byte prefix) for features like writemasking (using k registers for element-wise control, e.g., {k1}{z} to zero non-masked elements) and broadcasting from memory. The instruction VGATHERDPD gathers double-precision values using 32-bit indices (e.g., VGATHERDPD zmm1 {k1}, vm512), facilitating sparse data access in irregular datasets. Per-lane operations allow independent processing of vector lanes, enhancing flexibility. AVX-512 instructions extend prior sets, such as VADDPD now supporting ZMM widths with masking.[41]
These SIMD instructions find primary use in multimedia applications, where parallel operations accelerate video encoding, image filtering, and audio processing—for instance, ADDPS for pixel value adjustments or PSHUFB for color channel swaps. In machine learning, they optimize vectorized computations like matrix additions (VADDPD) and gather operations (VGATHERDPD) for neural network training on large datasets, providing substantial speedups in tensor operations.[41]
| Extension | Register Width | Key Registers | Encoding | Example Vector Capacity (Single-Precision Float) |
|---|---|---|---|---|
| SSE | 128-bit | XMM0-XMM15 | Legacy SSE | 4 elements |
| AVX | 256-bit | YMM0-YMM15 | VEX | 8 elements |
| AVX-512 | 512-bit | ZMM0-ZMM31 | EVEX | 16 elements |
Program Flow and Examples
Program Flow Control
In x86 assembly language, program flow control encompasses mechanisms for structuring code execution beyond basic linear sequencing, including subroutine management, asynchronous event handling, and conditional logic. These features enable modular programming, response to hardware events, and error recovery, forming the backbone of complex applications from operating systems to embedded software. Procedures allow for reusable code blocks, while interrupts and exceptions provide hooks for system-level interactions, all orchestrated through the processor's interrupt architecture and stack-based control transfers. Procedures in x86 assembly are defined using assembler-specific directives and invoked via the CALL and RET instructions, which manage the stack to preserve execution context. In Microsoft Macro Assembler (MASM), procedures are delimited by PROC and ENDP directives, which declare the entry point and scope, respectively, facilitating linkage and scoping for the subroutine. For instance, a simple procedure might be structured as follows:MyProc PROC
; procedure body
ret
MyProc ENDP
MyProc PROC
; procedure body
ret
MyProc ENDP
mov ecx, 10 ; loop counter
loop_start:
; loop body
dec ecx
jnz loop_start ; jump if not zero
mov ecx, 10 ; loop counter
loop_start:
; loop body
dec ecx
jnz loop_start ; jump if not zero
Basic Hello World Programs
A basic "Hello World" program in x86 assembly demonstrates fundamental input/output operations and program termination specific to the target operating environment. These examples illustrate how assembly code interacts with the system for simple text output, highlighting differences in calling conventions, system calls, and linking requirements across platforms. The programs are kept minimal to focus on core concepts like data declaration, register usage, and invocation of OS services. For 16-bit MS-DOS using MASM syntax, the program employs DOS interrupt 21h with function 09h in AH to print a null-terminated string (ending with '$'), followed by function 4Ch for program termination. The .model small directive specifies a small memory model suitable for DOS executables.[47]; hello.asm - 16-bit MS-DOS Hello World in MASM
.model small
.stack 128
.data
Msg db 'Hello, World!', 13, 10, '$' ; Message with CR/LF and terminator
.code
start:
mov ax, @data
mov ds, ax
mov ah, 09h
lea dx, Msg
int 21h
mov ah, 4Ch
int 21h
end start
; hello.asm - 16-bit MS-DOS Hello World in MASM
.model small
.stack 128
.data
Msg db 'Hello, World!', 13, 10, '$' ; Message with CR/LF and terminator
.code
start:
mov ax, @data
mov ds, ax
mov ah, 09h
lea dx, Msg
int 21h
mov ah, 4Ch
int 21h
end start
ml hello.asm to produce the executable hello.exe. This runs in real mode on MS-DOS or compatible emulators.[24]
In 32-bit Windows using MASM syntax, a graphical "Hello World" can invoke MessageBoxA from user32.dll to display the message in a dialog box, with ExitProcess from kernel32.dll for termination. The .model flat directive enables flat memory addressing, and the program follows the stdcall calling convention.
; hello.asm - 32-bit Windows Hello World in MASM with MessageBoxA
.386
.model flat, stdcall
option casemap:none
include windows.inc
include kernel32.inc
include user32.inc
includelib kernel32.lib
includelib user32.lib
.data
titleMsg db 'x86 Assembly', 0
msg db 'Hello, World!', 0
.code
Main:
push 0 ; MB_OK
push offset titleMsg ; Caption
push offset msg ; Text
push 0 ; HWND_DESKTOP
call MessageBoxA
push 0
call ExitProcess
end Main
; hello.asm - 32-bit Windows Hello World in MASM with MessageBoxA
.386
.model flat, stdcall
option casemap:none
include windows.inc
include kernel32.inc
include user32.inc
includelib kernel32.lib
includelib user32.lib
.data
titleMsg db 'x86 Assembly', 0
msg db 'Hello, World!', 0
.code
Main:
push 0 ; MB_OK
push offset titleMsg ; Caption
push offset msg ; Text
push 0 ; HWND_DESKTOP
call MessageBoxA
push 0
call ExitProcess
end Main
ml /c /coff hello.asm and link with link /subsystem:windows hello.obj user32.lib kernel32.lib /entry:Main /libpath:"C:\path\to\libs" to generate hello.exe.[48]
For 32-bit Linux using NASM syntax, the program uses system call 4 (sys_write) via INT 80h to output to stdout (file descriptor 1), with arguments in EBX (descriptor), ECX (buffer), and EDX (length), followed by system call 1 (sys_exit) with EBX as the exit code. No external libraries are required beyond the kernel.
; hello.asm - 32-bit Linux Hello World in NASM
SECTION .data
msg db 'Hello, World!', 10
msgLen equ $ - msg
SECTION .text
global _start
_start:
mov eax, 4 ; sys_write
mov ebx, 1 ; stdout
mov ecx, msg ; buffer
mov edx, msgLen ; [length](/page/Length)
int 80h
mov eax, 1 ; sys_exit
mov ebx, 0 ; exit code
int 80h
; hello.asm - 32-bit Linux Hello World in NASM
SECTION .data
msg db 'Hello, World!', 10
msgLen equ $ - msg
SECTION .text
global _start
_start:
mov eax, 4 ; sys_write
mov ebx, 1 ; stdout
mov ecx, msg ; buffer
mov edx, msgLen ; [length](/page/Length)
int 80h
mov eax, 1 ; sys_exit
mov ebx, 0 ; exit code
int 80h
nasm -f elf32 hello.asm -o hello.o and link with ld -m elf_i386 hello.o -o hello to produce the executable.[49]
In 64-bit Linux using NASM syntax, a higher-level approach links against libc to call printf for formatted output, leveraging the x86-64 System V ABI where the first argument is in RDI and RIP-relative addressing accesses data. The program uses position-independent code for the string reference.
; hello.asm - 64-bit Linux Hello World in NASM with printf
extern printf
extern exit
SECTION .data
msg db 'Hello, World!', 10, 0
SECTION .text
global main
main:
mov rdi, msg ; Argument in RDI (RIP-relative)
xor rax, rax ; No vector args
call [printf](/page/Printf)
mov rdi, 0
call exit
; hello.asm - 64-bit Linux Hello World in NASM with printf
extern printf
extern exit
SECTION .data
msg db 'Hello, World!', 10, 0
SECTION .text
global main
main:
mov rdi, msg ; Argument in RDI (RIP-relative)
xor rax, rax ; No vector args
call [printf](/page/Printf)
mov rdi, 0
call exit
nasm -f elf64 hello.asm -o hello.o and link with ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 hello.o -lc -o hello or simply gcc hello.o -o hello to include libc. This produces a dynamically linked executable.[50]
Advanced Usage Examples
Advanced usage of x86 assembly language often involves low-level manipulation of processor state and hardware interactions, enabling optimized or specialized code such as position-independent executables, dynamic code generation, and custom interrupt processing. These techniques leverage specific instructions to interact with flags, the instruction pointer, and system events, but require careful handling to ensure correctness across processor generations.[41] Flag manipulation is crucial for conditional control in performance-critical loops, where instructions like ADD can set flags such as the Carry Flag (CF) and Zero Flag (ZF) based on arithmetic results. The ADD instruction adds the source operand to the destination and stores the result in the destination, setting CF if there is a carry out of the most significant bit for unsigned operations and ZF if the result is zero.[41] Following this, the JC (Jump if Carry) instruction can branch to a label if CF is set, enabling efficient handling of overflow in unsigned arithmetic loops.[41] For instance, in a loop accumulating values until overflow:mov eax, 0xFFFFFFFF ; Initialize accumulator to max unsigned 32-bit value
mov ecx, 10 ; Loop counter
loop_start:
add eax, 1 ; Increment; sets CF if overflow
jc overflow_handler ; Jump if carry (overflow)
dec ecx
jnz loop_start
; Continue if no overflow
overflow_handler:
; Handle wrap-around
mov eax, 0xFFFFFFFF ; Initialize accumulator to max unsigned 32-bit value
mov ecx, 10 ; Loop counter
loop_start:
add eax, 1 ; Increment; sets CF if overflow
jc overflow_handler ; Jump if carry (overflow)
dec ecx
jnz loop_start
; Continue if no overflow
overflow_handler:
; Handle wrap-around
lea ebx, [rel $] loads the address of the current instruction into EBX, providing the code's base position for runtime relocations in PIC binaries.[41] An example in 64-bit PIC code to compute a relative offset to a data section:
lea rbx, [rel $] ; Load current RIP-relative position into RBX
add rbx, data_offset ; Adjust to target data location (offset computed at link time)
mov rax, [rbx] ; Access data at runtime-independent address
lea rbx, [rel $] ; Load current RIP-relative position into RBX
add rbx, data_offset ; Adjust to target data location (offset computed at link time)
mov rax, [rbx] ; Access data at runtime-independent address
jmp modify_code ; Jump to modifier
original_code: nop ; Placeholder instruction at address 0x1000 (example)
modify_code:
mov byte [0x1000], 0x50 ; Patch NOP (0x90) to PUSH AX (0x50) - simplistic example
cpuid ; Serialize: flush caches and pipeline
jmp 0x1000 ; Resume at modified code
jmp modify_code ; Jump to modifier
original_code: nop ; Placeholder instruction at address 0x1000 (example)
modify_code:
mov byte [0x1000], 0x50 ; Patch NOP (0x90) to PUSH AX (0x50) - simplistic example
cpuid ; Serialize: flush caches and pipeline
jmp 0x1000 ; Resume at modified code
keyboard_handler:
pushad ; Save registers
in al, 0x60 ; Read scancode from keyboard controller
; Process scancode (e.g., map to ASCII)
mov [key_buffer], al ; Store in buffer
mov al, 0x20 ; EOI to PIC
out 0x20, al ; Acknowledge interrupt
popad
iret ; Return, restoring RIP and EFLAGS
keyboard_handler:
pushad ; Save registers
in al, 0x60 ; Read scancode from keyboard controller
; Process scancode (e.g., map to ASCII)
mov [key_buffer], al ; Store in buffer
mov al, 0x20 ; EOI to PIC
out 0x20, al ; Acknowledge interrupt
popad
iret ; Return, restoring RIP and EFLAGS
mov rax, 1 ; Syscall number: write
mov rdi, 1 ; [File descriptor](/page/File_descriptor): stdout
mov rsi, msg ; Buffer address
mov rdx, len ; [Length](/page/Length)
syscall ; Invoke; RCX = saved [RIP](/page/The_Rip), R11 = saved RFLAGS
mov rax, 1 ; Syscall number: write
mov rdi, 1 ; [File descriptor](/page/File_descriptor): stdout
mov rsi, msg ; Buffer address
mov rdx, len ; [Length](/page/Length)
syscall ; Invoke; RCX = saved [RIP](/page/The_Rip), R11 = saved RFLAGS
