DOS MZ executable
View on WikipediaThis article needs additional citations for verification. (April 2015) |
| DOS MZ executable | |
|---|---|
| Filename extension | |
| Internet media type | application/x-dosexec, application/x-msdos-program, application/x-ms-dos-executable |
| Magic number | 4D 5A (MZ in ASCII) |
| Type of format | Binary, executable |
| Extended to | New Executable Linear Executable Portable Executable |
The DOS MZ executable format is the executable file format used for .EXE files under the DOS and Windows operating systems.
The file can be identified by the ASCII string "MZ" (hexadecimal: 4D 5A) at the beginning of the file (the "magic number"). "MZ" are the initials of Mark Zbikowski, one of the leading developers of MS-DOS.[1]
The MZ DOS executable file is newer than the COM executable format and differs from it. The DOS executable header contains relocation information, which allows multiple segments to be loaded at arbitrary memory addresses, and it supports executables larger than 64 KB; however, the format still requires relatively low memory limits. These limits were later bypassed using DOS extenders.
Segment handling
[edit]The environment of an EXE program run by DOS is found in its Program Segment Prefix.
EXE files normally have separate segments for the code, data, and stack. Program execution begins at address 0 of the code segment, and the stack pointer register is set to whatever value is contained in the header information (thus if the header specifies a 512 byte stack, the stack pointer is set to 200h). It is possible to not use a separate stack segment and simply use the code segment for the stack if desired.
The DS (data segment) register normally contains the same value as the CS (code segment) register and is not loaded with the actual segment address of the data segment when an EXE file is initialized; it is necessary for the programmer to set it themselves, generally done via the following instructions:
MOV AX, @DATA
MOV DS, AX
Termination
[edit]In the original DOS 1.x API, it was also necessary to have the CS register pointing to the segment with the PSP at program termination; this was done via the following instructions:
PUSH DS
XOR AX, AX
PUSH AX
Program termination would then be performed by a RETF instruction, which would retrieve the original segment address with the PSP from the stack and then jump to address 0, which contained an INT 20h instruction.
The DOS 2.x API introduced a new program termination function, INT 21h Function 4Ch which does not require saving the PSP segment address at the start of the program, and Microsoft advised against the use of the older DOS 1.x method.
Compatibility
[edit]MZ DOS executables can be run from DOS and Windows 9x-based operating systems. 32-bit Windows NT-based operating systems can execute them using their built-in Virtual DOS machine (although some graphics modes are unsupported). 64-bit versions of Windows cannot execute them. Alternative ways to run these executables include DOSBox and DOSEMU.
MZ DOS executables can be created by linkers, like Digital Mars Optlink, MS linker, VALX or Open Watcom's WLINK; additionally, FASM can create them directly.
See also
[edit]Further reading
[edit]- Paul, Matthias R. (2002-10-07) [2000]. "Re: Run a COM file". Newsgroup: alt.msdos.programmer. Retrieved 2017-09-03.
{{cite newsgroup}}: CS1 maint: deprecated archival service (link) - Matthias Paul (2002-10-07). "masm .com(PSP) related trouble". alt.lang.asm discussion group.
- Eager, Bob (2024-12-16). "Notes on the format of DOS .EXE files". PHOBOS reference material. Retrieved 2024-12-18.
References
[edit]- ^ Inside Windows: An In-Depth Look into the Win32 Portable Executable File Format - MSDN Magazine, February 2002 Archived 2018-07-11 at the Wayback Machine. "Every PE file begins with a small MS-DOS executable. ... The first bytes of a PE file begin with the traditional MS-DOS header, called an IMAGE_DOS_HEADER. The only two values of any importance are e_magic and e_lfanew. ... The e_magic field (a WORD) needs to be set to the value 0x5A4D. ... In ASCII representation, 0x5A4D is MZ, the initials of Mark Zbikowski, one of the original architects of MS-DOS."
External links
[edit]DOS MZ executable
View on Grokipedia4D 5A (ASCII "MZ") at the file's beginning, named after Microsoft engineer Mark Zbikowski.[1][2] Introduced as the successor to the simpler COM format, it enables programs exceeding the 64 KB segment limit through support for dynamic memory allocation and code relocation, making it suitable for larger applications on early IBM PC-compatible hardware.[3][4]
Historically, the MZ format debuted with MS-DOS 1.0 in 1981, evolving from pre-release versions that briefly used a "ZM" signature before standardizing on "MZ," and it remained the primary executable type for MS-DOS through versions up to 7.0, as well as for real-mode compatibility in Windows 9x/Me.[5][4] In modern contexts, it persists as the MS-DOS stub in Portable Executable (PE) files for Windows NT-based systems, providing backward compatibility by embedding a minimal DOS program that displays an error message like "This program cannot be run in DOS mode" when executed outside a DOS environment.[6] This design provides backward compatibility, allowing the file to be recognized as a valid MS-DOS executable that displays an error message when run in a DOS environment, while the PE header (offset at byte 60 or 0x3C in the MZ header) enables execution under Windows NT-based systems.[6][1]
The core structure of an MZ executable consists of a fixed 28-byte header followed by the program image, an optional relocation table, and potential overlays for additional data segments.[2][5] Key header fields include the number of 512-byte pages in the file, the bytes in the last page, the count of relocation entries, the header size in 16-byte paragraphs, minimum and maximum extra memory allocation in paragraphs, initial stack segment (SS) and pointer (SP) values, initial code segment (CS) and instruction pointer (IP), a checksum (often ignored in practice), the relocation table offset, and an overlay number (typically 0 for the main program).[1][2] The relocation table, if present, contains segment-offset pairs to adjust addresses during loading into memory, ensuring the program can run at various base addresses without fixed assumptions.[4] Beyond the standard header, extended fields may appear for compatibility with later formats, such as the offset to a New Executable (NE) or PE header.[5] This format's simplicity and efficiency defined early PC software distribution, influencing subsequent executable standards like the 32-bit PE format.[6][3]
Introduction
Overview
The DOS MZ executable, also known as the MS-DOS EXE format, is the standard executable file format used for .EXE files in MS-DOS, enabling the creation of programs larger than the 64 KB limit imposed by the simpler COM format.[5][7] Introduced to address the constraints of early DOS executables, it allows for segmented memory management, supporting applications that exceed single-segment boundaries while maintaining compatibility with the real-mode x86 architecture.[4] As a relocatable binary format, the DOS MZ executable facilitates loading and execution under DOS by permitting dynamic address relocation during runtime, a key advancement over non-relocatable formats like COM.[4] Files in this format begin with the magic number "MZ" (hexadecimal 4D 5A), serving as a signature for identification by loaders and tools.[5] The name "MZ" derives from Mark Zbikowski, a Microsoft developer who contributed to its design.[8] The format is associated with Internet media types such as application/x-dosexec, application/x-msdos-program, and application/x-ms-dos-executable.[9] It later influenced the development of advanced executable formats, including the New Executable (NE) and Portable Executable (PE).[4]History
The DOS MZ executable format was introduced with MS-DOS 1.0 in August 1981, serving as a significant advancement over the earlier flat COM format, which was limited to 64 KB programs without support for segmentation or relocation. Early pre-release versions briefly used a "ZM" signature before standardizing on "MZ", with DOS 1.0 accepting both for backward compatibility.[5][10] This new format enabled the creation of larger executables by dividing code and data into multiple segments that could be loaded into memory dynamically, addressing the growing complexity of software for the IBM PC platform.[11] Developed by a Microsoft team led by engineer Mark Zbikowski, who designed the format and whose initials form its signature ("MZ", hexadecimal 4D 5A at bytes 0 and 1), the MZ executable was tailored for x86 real-mode operation under MS-DOS, ensuring compatibility with the limited memory addressing of early personal computers.[8][12] Key milestones included built-in relocation support from the initial 1.x releases, allowing the loader to adjust segment addresses at runtime to fit available memory.[4] With MS-DOS 2.0 in 1983, enhancements improved overall memory management, introducing dynamic allocation and deallocation that complemented the MZ format's segmented loading for more efficient use of expanded RAM on systems with hard drives and larger floppies.[13] The MZ format's design influenced subsequent executable standards, paving the way for the New Executable (NE) format in 1985, which added dynamic linking for OS/2 and Windows 1.0 while retaining an MZ-compatible stub for backward compatibility.[14] It further evolved into the Linear Executable (LE) format in 1989, supporting 32-bit addressing for DOS extenders and OS/2 2.0 to handle protected-mode applications beyond real-mode constraints.[15] In the early PC software ecosystem, the MZ format became ubiquitous, serving as the output for major compilers and linkers such as Microsoft's LINK utility and Borland's Turbo linker, enabling the development and distribution of foundational applications like word processors and games throughout the 1980s.[5]File Format
Header Structure
The DOS MZ executable format begins with a fixed 28-byte header that provides essential metadata for loading and executing the program under MS-DOS. This header, also known as the EXE header, is parsed by the operating system to determine the file's size, memory requirements, initial register states, and location of supporting data structures. The header is always 28 bytes long, regardless of the file's overall size, and is immediately followed by the relocation table if any entries are present; otherwise, it is followed directly by the program's load image.[16][5] The header's layout is as follows:| Offset (hex) | Size (bytes) | Field Name | Description |
|---|---|---|---|
| 00 | 2 | Signature | The magic number 'MZ' (ASCII 4Dh 5Ah) or 'ZM' (5Ah 4Dh), identifying the file as a valid MZ executable. This signature honors Mark Zbikowski, a key Microsoft engineer involved in its design.[16][4] |
| 02 | 2 | Bytes in last page | The number of bytes in the final 512-byte page of the file, ranging from 1 to 511; a value of 0 indicates the last page is full (512 bytes). This field, combined with the total pages field, allows calculation of the exact file size.[16][5] |
| 04 | 2 | Total pages | The number of 512-byte pages in the file, including any partial last page. This value represents the file size in units of 512 bytes.[16][5] |
| 06 | 2 | Relocation entries count | The number of entries in the relocation table, which specifies offsets requiring segment address adjustments during loading. A value of 0 indicates no relocations are needed.[16][4] |
| 08 | 2 | Header paragraphs | The size of the header (including the relocation table) in paragraphs of 16 bytes each. This field indicates the starting offset of the loadable program image within the file. For example, a value of 4 means the header occupies the first 64 bytes.[16][5] |
| 0A | 2 | Minimum extra paragraphs | The minimum number of additional 16-byte paragraphs of memory the program requests beyond what is needed for the load image and program segment prefix (PSP). If insufficient memory is available, loading fails.[16][4] |
| 0C | 2 | Maximum extra paragraphs | The maximum number of additional 16-byte paragraphs the program can utilize beyond the minimum. DOS attempts to allocate as much as possible up to this limit from available memory. Historically, a value of FFFFh requests all remaining memory.[16][5] |
| 0E | 2 | Initial SS | The initial value for the stack segment register (SS), specified as a relocatable offset relative to the start of the load image. This helps establish the program's stack area in memory.[16][4] |
| 10 | 2 | Initial SP | The initial value for the stack pointer register (SP), defining the top of the stack upon program entry. Typically set to point to the end of the allocated stack space.[16][5] |
| 12 | 2 | Checksum | A simple integrity check computed as the one's complement of the sum of all 16-bit words in the file; the sum of all words including this checksum should equal zero. This field is often set to zero and not always verified by DOS.[16][4] |
| 14 | 2 | Initial IP | The initial value for the instruction pointer register (IP), indicating the offset within the code segment where execution begins.[16][5] |
| 16 | 2 | Initial CS | The initial value for the code segment register (CS), specified as a relocatable offset relative to the start of the load image. This points to the base of the program's code in memory.[16][4] |
| 18 | 2 | Relocation table offset | The byte offset from the start of the file to the beginning of the relocation table. If no relocations exist, this may point to the load image or be unused.[16][5] |
| 1A | 2 | Overlay number | A value used for overlay management in segmented programs; 0 indicates the main executable module, while nonzero values denote overlays loaded on demand. This field is typically 0 for non-overlay programs.[16][4] |
Segments and Relocation
The variable portions of a DOS MZ executable file follow the 28-byte header and optional relocation table, whose combined size is given by the header paragraphs field in 16-byte units, and consist primarily of the relocation table and the load image, which encompasses the program's code, data, and stack segments. The relocation table, if present, begins at an offset specified by the header's relocation table offset field (a 16-bit word at byte 24). It serves to enable relocatability by providing a list of locations within the load image where segment addresses must be adjusted based on the actual memory location chosen by the DOS loader.[4][17] The relocation table comprises an array of entries, with the number of entries indicated by a 16-bit word in the header at byte 6 (up to 65,535 possible, though practical limits are lower due to file size constraints). Each entry is 4 bytes long: the first 2 bytes represent the byte offset within the load image, and the second 2 bytes represent the paragraph offset relative to the start of the load image. These entries identify 16-bit words in the image that store segment addresses needing correction. The table is typically padded with null bytes to align to a 512-byte boundary. For example, an entry with offset 0x0010 and segment 0x0001 points to the location at (0x0001 * 16 + 0x0010) in the image, where the loader will add the base load segment to the existing 16-bit value to fix references, such as in instructions addressing data segments.[4][18][17] Following the relocation table, the load image forms the core of the executable, divided into logical segments for code, data, and stack without dedicated segment headers or explicit boundaries in the file. Instead, segment positions are inferred from the header's initial CS and SS fields (offsets 0x16 and 0x0E, relocatable relative to the load image start) and the overall image size, calculated from the file length minus the header and relocation table sizes. The entire image is loaded contiguously into memory in 16-byte paragraphs, with the code segment typically first, followed by data, and reserved space at the end for the stack. This organization allows the program to exceed the 64 KB limit of .COM files by supporting multiple segments, though each individual segment is constrained to 64 KB due to 16-bit addressing.[4][19][5] During loading, the DOS loader processes the relocation table sequentially after mapping the image to a base segment address (usually just above the 256-byte Program Segment Prefix). For each entry, it computes the target address in the loaded memory as (entry segment paragraph * 16 + entry offset), then adds the base load segment value to the 16-bit word at that address, ensuring all intra-segment and inter-segment references (e.g., to data or stack) resolve correctly regardless of the load position. This fix-up occurs before transferring control to the program's entry point, making the executable position-independent within DOS's segmented memory model.[4][18][17] A key limitation of this scheme is the 64 KB cap per segment, as offsets within segments are 16-bit; larger programs rely on overlays, where non-resident segments are manually swapped into memory by the application using DOS interrupts, rather than automatic loading. The relocation table itself is capped by available file space and header fields, often resulting in fewer than 2,048 entries in typical programs to avoid excessive overhead. No formal support exists for dynamic relocation beyond this static table, reflecting the format's design for real-mode 8086 processors.[5][19][18]Execution Process
Loading Mechanism
The MS-DOS loader initiates the execution of an MZ executable by invoking interrupt 21h with AH=4Bh to load and transfer control to the program. Upon receiving the file path and parameter block, the loader first opens the file and reads the fixed 28-byte (0x1C bytes, offsets 00h-1Bh) header starting at offset 0 to verify the MZ signature and extract key parameters.[20] The header fields at offsets 02h-03h (bytes in last page) and 04h-05h (total pages) determine the file size as (pages - 1) × 512 + bytes in last page, from which the load module size is calculated by subtracting the header length (from offsets 08h-09h, in 16-byte paragraphs).[20] Memory allocation begins with the creation of the Program Segment Prefix (PSP) at the lowest available memory address, typically segment 0x0000 for the initial COMMAND.COM but offset for child processes. The loader then reserves a block of memory paragraphs sufficient for the load module plus additional space dictated by the minimum (offsets 0Ah-0Bh) and maximum (offsets 0Ch-0Dh) allocation fields in the header; it attempts to allocate up to 0xFFFF paragraphs if available, or the largest contiguous free block meeting or exceeding the minimum requirements plus load size.[20] If the available memory falls short of the minimum allocation plus load size, the loader terminates the process with error code 8 (insufficient memory), displayed as "FC" on the console.[20] The executable's load image is mapped starting at a base segment address that is 16 paragraphs (256 bytes or 0x100 bytes) after the PSP segment (PSP segment + 0x10), ensuring separation from the PSP and environment variables, which precede the executable segments in memory.[20] With memory allocated, the loader reads the file pages sequentially into the target memory region beginning at the base segment, skipping the header and loading only the resident portion of the program (overlays, indicated by the overlay number at offsets 1Ah-1Bh, are deferred for on-demand loading by the program itself during runtime).[20] Relocation follows, where the loader accesses the relocation table (at the offset specified in offsets 18h-19h, with entry count at offsets 06h-07h) and, for each 4-byte entry consisting of an offset word followed by a relocation segment word, adds the value (base segment + relocation segment) to the 16-bit word at the specified offset within the loaded image, patching absolute addresses to reflect the actual load location.[20] This ensures the relocatable code and data segments are correctly adjusted without fixed assumptions about memory placement. Finally, the loader initializes the CPU registers for program entry: DS and ES are set to the PSP segment for access to command-line parameters and file control blocks; SS is set to the stack segment value from header offsets 0Eh-0Fh plus the base segment adjustment, with SP loaded from offsets 10h-11h; CS is set to the code segment from offsets 16h-17h plus the base adjustment, and IP to the initial offset from 14h-15h (offsets 12h-13h contain the checksum, often ignored).[20] Control then transfers to the entry point via a far jump, commencing execution while the original caller awaits termination via interrupt 21h AH=4Ch.[20] The header fields for initial CS:IP thus directly determine the starting execution point after loading.[20]Segment Handling
In DOS MZ executables, the code segment register (CS) is automatically initialized by the loader to the relocated value from the EXE header at offset 0x16, with the instruction pointer (IP) set to the entry point offset from offset 0x14, enabling execution to begin at the start of the code segment (often offset 0 relative to CS).[4] The stack segment register (SS) is similarly set to the relocated value from header offset 0x0E, and the stack pointer (SP) to the initial value from offset 0x10, typically allocating an initial stack size of 512 bytes (0x0200) for basic program operations.[4] In contrast, the data segment register (DS) and extra segment register (ES) are initialized by the loader to point to the Program Segment Prefix (PSP), a 256-byte structure used by DOS for program management, rather than the program's own data area. Programmers must therefore manually initialize DS (and often ES) at runtime to access the program's data segment correctly, as failure to do so results in data access faults or corruption when instructions reference memory through the PSP instead of the intended data. Access between segments requires far calls or jumps (e.g.,CALL FAR PTR segment:offset), which load the target segment into CS, while intra-segment operations use near calls or jumps that preserve the current segment register.[4] For data structures exceeding 64 KB—the maximum size of a single segment—manual segment arithmetic is necessary, such as incrementing a segment register by the appropriate multiple of 16 (e.g., ADD DS, 0x1000 to shift by 64 KB) before offsetting into the extended area.
Stack operations occur automatically within the SS segment via the SP register, with push instructions decrementing SP by 2 bytes (for 16-bit words) before storing data and pop instructions reversing the process; the initial allocation from the header provides sufficient space for function calls and local variables in small programs, though larger applications may adjust SS or SP dynamically if more stack is needed.[4] Overlay handling, supported by the MZ format's overlay number field in the header (offset 0x1C), demands custom code to manage segment swaps, such as loading overlay data from the file into a temporary segment and updating segment registers to switch between the main program and overlays on demand.[4]
A typical initialization sequence in assembly code at the program's entry point appears as follows, ensuring DS points to the data segment defined during linking (e.g., via the @data symbol):
.MODEL SMALL
.CODE
START:
MOV AX, @DATA ; Load data segment address into AX
MOV DS, AX ; Set DS to data segment
MOV ES, AX ; Optionally set ES to data segment for string operations
; Program logic follows
This setup, common in MS-DOS assembly and high-level language runtimes, prevents immediate faults and enables reliable data access throughout execution.