Hubbry Logo
ZIP (file format)ZIP (file format)Main
Open search
ZIP (file format)
Community hub
ZIP (file format)
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
ZIP (file format)
ZIP (file format)
from Wikipedia

ZIP file format
Filename extension.zip, .zipx, .z01, .zx01
Internet media typeapplication/zip, application/x-zip-compressed[1]
Uniform Type Identifier (UTI)com.pkware.zip-archive
Magic number
  • none
  • PK\x03\x04
  • PK\x05\x06 (empty)
  • PK\x07\x08 (spanned)
Developed byPKWARE, Inc.
Initial release14 February 1989; 36 years ago (1989-02-14)
Latest release
6.3.10
1 November 2022; 2 years ago (2022-11-01)
Type of formatData compression
Extended to
Standard
Open format?Yes

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility,[2] as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support (under the name "compressed folders") in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME.[citation needed] Apple has included built-in ZIP support in Mac OS X 10.3 (via BOMArchiveHelper, now Archive Utility) and later. Most free operating systems have built in support for ZIP in similar manners to Windows and macOS.

ZIP files generally use the file extensions .zip or .ZIP and the MIME media type application/zip.[1] ZIP is used as a base file format by many programs, usually under a different name. When navigating a file system via a user interface, graphical icons representing ZIP files often appear as a document or other object prominently featuring a zipper.

History

[edit]

The .ZIP file format was designed by Phil Katz of PKWARE and Gary Conway of Infinity Design Concepts. The format was created after Systems Enhancement Associates (SEA) filed a lawsuit against PKWARE claiming that the latter's archiving products, named PKARC, were derivatives of SEA's ARC archiving system.[3] The name "zip" (meaning "move at high speed") was suggested by Katz's friend, Robert Mahoney.[4] They wanted to imply that their product would be faster than ARC and other compression formats of the time.[4] The earliest known version of .ZIP File Format Specification was first published as part of PKZIP 0.9 package under the file APPNOTE.TXT in 1989.[citation needed] By distributing the zip file format within APPNOTE.TXT, compatibility with the zip file format proliferated widely on the public Internet during the 1990s.[5]

PKWARE and Infinity Design Concepts made a joint press release on February 14, 1989, releasing the .ZIP file format into the public domain.[6][7][8][9][10]

Version history

[edit]

The .ZIP File Format Specification has its own version number, which does not necessarily correspond to the version numbers for the PKZIP tool, especially with PKZIP 6 or later. At various times, PKWARE has added preliminary features that allow PKZIP products to extract archives using advanced features, but PKZIP products that create such archives are not made available until the next major release. Other companies or organizations support the PKWARE specifications at their own pace.

The .ZIP file format specification is formally named "APPNOTE - .ZIP File Format Specification" and it is published on the PKWARE.com website since the late 1990s.[11] Several versions of the specification were not published. Specifications of some features such as BZIP2 compression, strong encryption specification and others were published by PKWARE a few years after their creation. The URL of the online specification was changed several times on the PKWARE website.

A summary of key advances in various versions of the PKWARE software and/or specification:

  • 2.0: (1993)[1] File entries can be compressed with DEFLATE and use traditional PKWARE encryption (ZipCrypto).
  • 2.1: (1996) Deflate64 compression support (claimed in APPNOTE 6.1.0 published much later).[12] APPNOTE may not have been published for 2.1.
  • 2.5: PKWARE DCL Implode compression.[12] APPNOTE may not have been published for 2.5.
  • 2.5: Deflate64 compression support (claimed in later user manuals, e.g. in 2004.)[13]
  • 4.0: (2000) Deflate64 compression support (according to information provided by Jim Peterson, Chief Scientist, PKWARE, to the Library of Congress; and APPNOTE 4.0).[14][15]
  • 4.5: (2001)[16] Documented 64-bit zip format.
  • 4.6: (2001) BZIP2 compression (not published online until the publication of APPNOTE 5.2)
  • 5.0: (2002) SES: DES, Triple DES, RC2, RC4 supported for encryption (not published online until the publication of APPNOTE 5.2)
  • 5.2: (2003)[17][18] AES encryption support for SES (defined in APPNOTE 5.1 that was not published online) and AES from WinZip ("AE-x"); corrected version of RC2-64 supported for SES encryption.
  • 6.1: (2004)[12] Documented certificate storage.
  • 6.2.0: (2004)[19] Documented Central Directory Encryption.
  • 6.3.0: (2006)[20] Documented Unicode (UTF-8) filename storage. Expanded list of supported compression algorithms (LZMA, PPMd+), encryption algorithms (Blowfish, Twofish), and hashes.
  • 6.3.1: (2007)[21] Corrected standard hash values for SHA-256/384/512.
  • 6.3.2: (2007)[22] Documented compression method 97 (WavPack).
  • 6.3.3: (2012)[23] Document formatting changes to facilitate referencing the PKWARE Application Note from other standards using methods such as the JTC 1 Referencing Explanatory Report (RER) as directed by JTC 1/SC 34 N 1621.
  • 6.3.4: (2014)[24] Updates the PKWARE, Inc. office address.
  • 6.3.5: (2018)[25] Documented compression methods 16, 96 and 99, DOS timestamp epoch and precision, added extra fields for keys and decryption, as well as typos and clarifications.
  • 6.3.6: (2019)[26] Corrected typographical error.
  • 6.3.7: (2020)[27] Added Zstandard compression method ID 20.
  • 6.3.8: (2020)[28] Moved Zstandard compression method ID from 20 to 93, deprecating the former. Documented method IDs 94 and 95 (MP3 and XZ respectively).
  • 6.3.9: (2020)[29] Corrected a typo in Data Stream Alignment description.
  • 6.3.10: (2022)[30] Added several z/OS attribute values for APPENDIX B. Added several additional 3rd party Extra Field mappings.

WinZip, starting with version 12.1, uses the extension .zipx for ZIP files that use compression methods newer than DEFLATE; specifically, methods BZip, LZMA, PPMd, Jpeg and Wavpack. The last 2 are applied to appropriate file types when "Best method" compression is selected.[31][32]

Standardization

[edit]

In April 2010, ISO/IEC JTC 1 initiated a ballot to determine whether a project should be initiated to create an ISO/IEC International Standard format compatible with ZIP.[33] The proposed project, entitled Document Packaging, envisaged a ZIP-compatible 'minimal compressed archive format' suitable for use with a number of existing standards including OpenDocument, Office Open XML and EPUB. It would solve problems such as the need for a formal standard, the variety of extensions of ZIP, the undesirability of a technology used for Open Standards potentially having proprietary extensions or "submarine" patents (i.e. which could surface unexpectedly), the need for better internationalization, and a desire not to actually fragment the technology further by purporting to provide an alternative specification to the PKWARE APPNOTE document.

In 2015, ISO/IEC 21320-1 "Document Container File — Part 1: Core" was published which states that "Document container files are conforming Zip files", normatively referencing the PKWARE APPNOTE document. It requires the following main restrictions of the ZIP file format:[34]

  • Files in ZIP archives may only be stored uncompressed, or using the "deflate" compression (i.e. compression method may contain the value "0" - stored or "8" - deflated). The patent on the core "deflate" compression method expired in late 2010.[35]
  • The encryption features are prohibited.
  • The digital signature features (from SES) are prohibited.
  • The "patched data" features (from PKPatchMaker) are prohibited.
  • Archives may not span multiple volumes or be segmented.

Design

[edit]

.ZIP files are archives that store multiple files. ZIP allows contained files to be compressed using many different methods, as well as simply storing a file without compressing it. Each file is stored separately, allowing different files in the same archive to be compressed using different methods. Because the files in a ZIP archive are compressed individually, it is possible to extract them, or add new ones, without applying compression or decompression to the entire archive. This contrasts with the format of compressed tar files, for which such random-access processing is not easily possible.

A directory is placed at the end of a ZIP file. This identifies what files are in the ZIP and identifies where in the ZIP that file is located. This allows ZIP readers to load the list of files without reading the entire ZIP archive. ZIP archives can also include extra data that is not related to the ZIP archive. This allows for a ZIP archive to be made into a self-extracting archive (application that decompresses its contained data), by prepending the program code to a ZIP archive and marking the file as executable. Storing the catalog at the end also makes possible to hide a zipped file by appending it to an innocuous file, such as a GIF image file.

The .ZIP format uses CRC-32 and includes two copies of each entry metadata to provide greater protection against data loss. The CRC-32 algorithm was contributed by David Schwaderer and can be found in his book "C Programmers Guide to NetBIOS" published by Howard W. Sams & Co. Inc.[36]

Structure

[edit]
ZIP-64 Internal Layout

A ZIP file is correctly identified by the presence of an end of central directory record which is located at the end of the archive structure in order to allow the easy appending of new files. If the end of central directory record indicates a non-empty archive, the name of each file or directory within the archive should be specified in a central directory entry, along with other metadata about the entry, and an offset into the ZIP file, pointing to the actual entry data. This allows a file listing of the archive to be performed relatively quickly, as the entire archive does not have to be read to see the list of files. The entries within the ZIP file also include this information, for redundancy, in a local file header. Because ZIP files may be appended to, only files specified in the central directory at the end of the file are valid. Scanning a ZIP file for local file headers is invalid (except in the case of corrupted archives), as the central directory may declare that some files have been deleted and other files have been updated.

For example, we may start with a ZIP file that contains files A, B and C. File B is then deleted and C updated. This may be achieved by just appending a new file C to the end of the original ZIP file and adding a new central directory that only lists file A and the new file C. When ZIP was first designed, transferring files by floppy disk was common, yet writing to disks was very time-consuming. If you had a large zip file, possibly spanning multiple disks, and only needed to update a few files, rather than reading and re-writing all the files, it would be substantially faster to just read the old central directory, append the new files then append an updated central directory.

The order of the file entries in the central directory need not coincide with the order of file entries in the archive.

Each entry stored in a ZIP archive is introduced by a local file header with information about the file such as the comment, file size and file name, followed by optional "extra" data fields, and then the possibly compressed, possibly encrypted file data. The "Extra" data fields are the key to the extensibility of the ZIP format. "Extra" fields are exploited to support the ZIP64 format, WinZip-compatible AES encryption, file attributes, and higher-resolution NTFS or Unix file timestamps. Other extensions are possible via the "Extra" field. ZIP tools are required by the specification to ignore Extra fields they do not recognize.

The ZIP format uses specific 4-byte "signatures" to denote the various structures in the file. Each file entry is marked by a specific signature. The end of central directory record is indicated with its specific signature, and each entry in the central directory starts with the 4-byte central file header signature.

There is no BOF or EOF marker in the ZIP specification. Conventionally the first thing in a ZIP file is a ZIP entry, which can be identified easily by its local file header signature. However, this is not necessarily the case, as this is not required by the ZIP specification - most notably, a self-extracting archive will begin with an executable file header.

Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because (as previously mentioned in this section) only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures. However, tools that attempt to recover data from damaged ZIP archives will most likely scan the archive for local file header signatures; this is made more difficult by the fact that the compressed size of a file chunk may be stored after the file chunk, making sequential processing difficult.

Most of the signatures end with the short integer 0x4b50, which is stored in little-endian ordering. Viewed as an ASCII string this reads "PK", the initials of the inventor Phil Katz. Thus, when a ZIP file is viewed in a text editor the first two bytes of the file are usually "PK". (DOS, OS/2 and Windows self-extracting ZIPs have an EXE before the ZIP so start with "MZ"; self-extracting ZIPs for other operating systems may similarly be preceded by executable code for extracting the archive's content on that platform.)

The .ZIP specification also supports spreading archives across multiple file-system files. Originally intended for storage of large ZIP files across multiple floppy disks, this feature is now used for sending ZIP archives in parts over email, or over other transports or removable media.

The FAT filesystem of DOS has a timestamp resolution of only two seconds; ZIP file records mimic this. As a result, the built-in timestamp resolution of files in a ZIP archive is only two seconds, though extra fields can be used to store more precise timestamps. The ZIP format has no notion of time zone, so timestamps are only meaningful if it is known what time zone they were created in.

In September 2006, PKWARE released a revision of the ZIP specification providing for the storage of file names using UTF-8, finally adding Unicode compatibility to ZIP.[20]

File headers

[edit]

All multi-byte values in the header are stored in little-endian byte order. All length fields count the length in bytes.

Local file header

[edit]
Offset
(bytes)
Size
(bytes)
Description[37]
0 4 Magic number. Must be 50 4B 03 04.
4 2 Version needed to extract (minimum).
6 2 General purpose bit flag.
8 2 Compression method; e.g. none = 0, DEFLATE = 8 (or "\0x08\0x00").
10 2 File last modification time.
12 2 File last modification date.
14 4 CRC-32 of uncompressed data.
18 4 Compressed size (or FF FF FF FF for ZIP64).
22 4 Uncompressed size (or FF FF FF FF for ZIP64).
26 2 File name length (n).
28 2 Extra field length (m).
30 n File name.
30+n m Extra field.

The extra field contains a variety of optional data such as OS-specific attributes. It is divided into records, each with at minimum a 16-bit signature and a 16-bit length. A ZIP64 local file extra field record, for example, has the signature 0x0001 and a length of 16 bytes (or more) so that two 64-bit values (the uncompressed and compressed sizes) may follow. Another common local file extension is 0x5455 (or "UT") which contains 32-bit UTC UNIX timestamps.

This is immediately followed by the compressed data.

Data descriptor

[edit]

If the bit at offset 3 (0x08) of the general-purpose flags field is set, then the CRC-32 and file sizes are not known when the header is written. If the archive is in Zip64 format, the compressed and uncompressed size fields are 8 bytes long instead of 4 bytes long (see section 4.3.9.2[38]). The equivalent fields in the local header (or in the Zip64 extended information extra field in the case of archives in Zip64 format) are filled with zero, and the CRC-32 and size are appended in a 12-byte structure (optionally preceded by a 4-byte signature) immediately after the compressed data:

Offset
(bytes)
Size
(bytes)
Description[37]
0 0 or 4 Optional magic number. If present, must be 50 4B 07 08.
0 or 4 4 CRC-32 of uncompressed data.
4 or 8 4 or 8 Compressed size.
8 or 12 or 16 4 or 8 Uncompressed size.

Central directory file header (CDFH)

[edit]

The central directory file header entry is an expanded form of the local header:

Offset
(bytes)
Size
(bytes)
Description[37]
0 4 Magic number. Must be 50 4B 01 02.
4 2 Version made by.
6 2 Version needed to extract (minimum).
8 2 General purpose bit flag.
10 2 Compression method.
12 2 File last modification time.
14 2 File last modification date.
16 4 CRC-32 of uncompressed data.
20 4 Compressed size (or FF FF FF FF for ZIP64).
24 4 Uncompressed size (or FF FF FF FF for ZIP64).
28 2 File name length (n).
30 2 Extra field length (m).
32 2 File comment length (k).
34 2 Disk number where file starts (or FF FF for ZIP64).
36 2 Internal file attributes.
38 4 External file attributes.
42 4 Relative offset of local file header (or FF FF FF FF for ZIP64). This is the number of bytes between the start of the first disk on which the file occurs, and the start of the local file header. This allows software reading the central directory to locate the position of the file inside the ZIP file.
46 n File name.
46+n m Extra field.
46+n+m k File comment.

End of central directory record (EOCD)

[edit]

After all the central directory entries comes the end of central directory (EOCD) record, which marks the end of the ZIP file:

Offset
(bytes)
Size
(bytes)
Description[37]
0 4 Magic number. Must be 50 4B 05 06.
4 2 Number of this disk (or FF FF for ZIP64).
6 2 Disk where central directory starts (or FF FF for ZIP64).
8 2 Number of central directory records on this disk (or FF FF for ZIP64).
10 2 Total number of central directory records (or FF FF for ZIP64).
12 4 Size of central directory in bytes (or FF FF FF FF for ZIP64).
16 4 Offset of start of central directory, relative to start of archive (or FF FF FF FF for ZIP64).
20 2 Comment length (n).
22 n Comment.

This ordering allows a ZIP file to be created in one pass, but the central directory is also placed at the end of the file in order to facilitate easy removal of files from multiple-part (e.g. "multiple floppy-disk") archives, as previously discussed.

Compression methods

[edit]

The .ZIP File Format Specification documents the following compression methods: Store (no compression), Shrink (LZW), Reduce (levels 1–4; LZ77 + probabilistic), Implode, Deflate, Deflate64, bzip2, LZMA, Zstandard, WavPack, PPMd, and a LZ77 variant provided by IBM z/OS CMPSC instruction.[39][30] The most commonly used compression method is DEFLATE, which is described in IETF RFC 1951.

Other methods mentioned, but not documented in detail in the specification include: PKWARE DCL Implode (old IBM TERSE), new IBM TERSE, IBM LZ77 z Architecture (PFS), and a JPEG variant. A "Tokenize" method was reserved for a third party, but support was never added.[25]

The word Implode is overused by PKWARE: the DCL/TERSE Implode is distinct from the old PKZIP Implode, a predecessor to Deflate. The DCL Implode is undocumented partially due to its proprietary nature held by IBM, but Mark Adler has nevertheless provided a decompressor called "blast" alongside zlib.[40]

Encryption

[edit]

ZIP supports a simple password-based symmetric encryption system generally known as ZipCrypto. It is documented in the ZIP specification, and known to be seriously flawed. In particular, it is vulnerable to known-plaintext attacks, which are in some cases made worse by poor implementations of random-number generators.[5] Computers running under native Microsoft Windows without third-party archivers can open, but not create, ZIP files encrypted with ZipCrypto, but cannot extract the contents of files using different encryption.[41]

New features including new compression and encryption (e.g. AES) methods have been documented in the ZIP File Format Specification since version 5.2. A WinZip-developed AES-based open standard ("AE-x" in APPNOTE) is used also by 7-Zip and Xceed, but some vendors use other formats.[42] PKWARE SecureZIP (SES, proprietary) also supports RC2, RC4, DES, Triple DES encryption methods, Digital Certificate-based encryption and authentication (X.509), and archive header encryption. It is, however, patented (see § Strong encryption controversy).[43]

File name encryption is introduced in .ZIP File Format Specification 6.2, which encrypts metadata stored in Central Directory portion of an archive, but Local Header sections remain unencrypted. A compliant archiver can falsify the Local Header data when using Central Directory Encryption. As of version 6.2 of the specification, the Compression Method and Compressed Size fields within Local Header are not yet masked.

ZIP64

[edit]

The original .ZIP format had a 4 GiB (232 bytes) limit on various things (uncompressed size of a file, compressed size of a file, and total size of the archive), as well as a limit of 65,535 (216 − 1) entries in a ZIP archive. In version 4.5 of the specification (which is not the same as v4.5 of any particular tool), PKWARE introduced the "ZIP64" format extensions to get around these limitations, increasing the limits to 16 EiB (264 bytes). In essence, it uses a "normal" central directory entry for a file, followed by an optional "zip64" directory entry, which has the larger fields.[44]

The format of the Local file header (LOC) and Central directory file header (CDFH) are the same in ZIP and ZIP64. However, ZIP64 specifies an extra field that may be added to those records at the discretion of the compressor, whose purpose is to store values that do not fit in the classic LOC or CDFH records. To signal that the actual values are stored in ZIP64 extra fields, they are set to 0xFFFF or 0xFFFFFFFF in the corresponding LOC or CDFH record. If one entry does not fit into the classic LOC or CDFH record, only that entry is required to be moved into a ZIP64 extra field. The other entries may stay in the classic record. Therefore, not all entries shown in the following table might be stored in a ZIP64 extra field. However, if they appear, their order must be as shown in the table.

Zip64 extended information extra field
Offset
(bytes)
Size
(bytes)
Description[37]
0 2 Header ID 0x0001.
2 2 Size of the extra field chunk (8, 16, 24 or 28).
4 8 Original uncompressed file size.
12 8 Size of compressed data.
20 8 Offset of local header record.
28 4 Number of the disk on which this file starts.

On the other hand, the format of EOCD for ZIP64 is slightly different from the normal ZIP version.[37]

Zip64 End of central directory record (EOCD64)
Offset
(bytes)
Size
(bytes)
Description[37]
0 4 Magic number. Must be 50 4B 06 06.
4 8 Size of the EOCD64 minus 12.
12 2 Version made by.
14 2 Version needed to extract (minimum).
16 4 Number of this disk.
20 4 Disk where central directory starts.
24 8 Number of central directory records on this disk.
32 8 Total number of central directory records.
40 8 Size of central directory in bytes.
48 8 Offset of start of central directory, relative to start of archive.
56 n Comment (up to the size of EOCD64).

The EOCD64 is not necessarily the last record in the file. It is followed by a 20 byte End of Central Directory Locator, and the classic EOCD record.

Zip64 End of Central Directory Locator
Offset
(bytes)
Size
(bytes)
Description[37]
0 4 Magic number. Must be 50 4B 06 07.
4 4 Disk where EOCD64 starts.
8 8 Offset to start of EOCD64, relative to start of archive.
16 4 Total number of disks.

The File Explorer in Windows XP does not support ZIP64, but the Explorer in Windows Vista and later do.[citation needed] Likewise, some extension libraries support ZIP64, such as DotNetZip, QuaZIP[45] and IO::Compress::Zip in Perl. Python's built-in zipfile supports it since 2.5 and defaults to it since 3.4.[46] OpenJDK's built-in java.util.zip supports ZIP64 from version Java 7.[47] Android Java API support ZIP64 since Android 6.0.[48] Mac OS Sierra's Archive Utility notably does not support ZIP64, and can create corrupt archives when ZIP64 would be required.[49] However, the ditto command shipped with Mac OS will unzip ZIP64 files.[50] More recent[when?] versions of Mac OS ship with info-zip's zip and unzip command line tools which do support Zip64: to verify run zip -v and look for "ZIP64_SUPPORT".

Combination with other file formats

[edit]

The .ZIP file format allows for a comment containing up to 65,535 (216 − 1) bytes of data to occur at the end of the file after the central directory.[37] Also, because the central directory specifies the offset of each file in the archive with respect to the start, it is possible for the first file entry to start at an offset other than zero, although some tools might not process archive files that do not start with a file entry at offset zero. The program gzip, for example, happens to be able to extract an entry from a .ZIP file if it is at offset zero.

This allows arbitrary data to occur in the file both before and after the ZIP archive data, and for the archive to still be read by a ZIP application. A side-effect of this is that it is possible to author a file that is both a working ZIP archive and another format, provided that the other format tolerates arbitrary data at its end, beginning, or middle. Self-extracting archives (SFX), of the form supported by WinZip, take advantage of this, in that they are executable (.exe) files that conform to the PKZIP AppNote.txt specification, and can be read by compliant zip tools or libraries.

This property of the .ZIP format, and of the JAR format which is a variant of ZIP, can be exploited to hide rogue content (such as harmful Java classes) inside a seemingly harmless file, such as a GIF image uploaded to the web. This so-called GIFAR exploit has been demonstrated as an effective attack against web applications such as Facebook.[51]

Limits

[edit]

The minimum size of a .ZIP file is 22 bytes. Such an empty zip file contains only an End of Central Directory Record (EOCD): 50 4B 05 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

The maximum size for both the archive file and the individual files inside it is 4,294,967,295 bytes (232 − 1 bytes, or 4 GiB minus 1 byte) for standard ZIP. For ZIP64, the maximum size is 18,446,744,073,709,551,615 bytes (264 − 1 bytes, or 16 EiB minus 1 byte).[52]

Open extensions

[edit]

Seek-optimized (SOZip) profile

[edit]

A Seek-Optimized ZIP file (SOZip) profile[53] has been proposed for the ZIP format. Such file contains one or several Deflate-compressed files that are organized and annotated such that a SOZip-aware reader can perform very fast random access (seek) within a compressed file. SOZip makes it possible to access large compressed files directly from a .zip file without prior decompression. It combines the use of ZLib block flushes issued at regular interval with a hidden index file mapping offsets of the uncompressed file to offsets in the compressed stream. ZIP readers that are not aware of that extension can read a SOZip-enabled file normally and ignore the extended features that support efficient seek capability.

Proprietary extensions

[edit]

Extra field

[edit]

.ZIP file format includes an extra field facility within file headers, which can be used to store extra data not defined by existing ZIP specifications, and which allow compliant archivers that do not recognize the fields to safely skip them. Header IDs 0–31 are reserved for use by PKWARE. The remaining IDs can be used by third-party vendors for proprietary usage.

Strong encryption controversy

[edit]

When WinZip 9.0 public beta was released in 2003, WinZip introduced its own AES-256 encryption, using a different file format, along with the documentation for the new specification.[54] The encryption standards themselves were not proprietary, but PKWARE had not updated APPNOTE.TXT to include Strong Encryption Specification (SES) since 2001, which had been used by PKZIP versions 5.0 and 6.0. WinZip technical consultant Kevin Kearney and StuffIt product manager Mathew Covington accused PKWARE of withholding SES, but PKZIP chief technology officer Jim Peterson claimed that certificate-based encryption was still incomplete.

In another controversial move, PKWare applied for a patent on 16 July 2003 describing a method for combining ZIP and strong encryption to create a secure file.[55]

In the end, PKWARE and WinZip agreed to support each other's products. On 21 January 2004, PKWARE announced the support of WinZip-based AES compression format.[56] In a later version of WinZip beta, it was able to support SES-based ZIP files.[57] PKWARE eventually released version 5.2 of the .ZIP File Format Specification to the public, which documented SES. The Free Software project 7-Zip also supports AES, but not SES in ZIP files (as does its POSIX port p7zip).

When using AES encryption under WinZip, the compression method is always set to 99, with the actual compression method stored in an AES extra data field.[58] In contrast, Strong Encryption Specification stores the compression method in the basic file header segment of Local Header and Central Directory, unless Central Directory Encryption is used to mask/encrypt metadata.

Implementation

[edit]

There are numerous .ZIP tools available, and numerous .ZIP libraries for various programming environments; licenses used include proprietary and free software. WinZip, WinRAR, Info-ZIP, ZipGenius, 7-Zip, PeaZip and B1 Free Archiver are well-known .ZIP tools, available on various platforms. Some of those tools have library or programmatic interfaces.

Some development libraries licensed under open source agreement are libzip, libarchive, and Info-ZIP. For Java: Java Platform, Standard Edition contains the package "java.util.zip" to handle standard .ZIP files; the Zip64File library specifically supports large files (larger than 4 GiB) and treats .ZIP files using random access; and the Apache Ant tool contains a more complete implementation released under the Apache Software License.

The Info-ZIP implementations of the .ZIP format adds support for Unix filesystem features, such as user and group IDs, file permissions, and support for symbolic links. The Apache Ant implementation is aware of these to the extent that it can create files with predefined Unix permissions. The Info-ZIP implementations also know how to use the error correction capabilities built into the .ZIP compression format. Some programs do not, and will fail on a file that has errors.

The Info-ZIP Windows tools also support NTFS filesystem permissions, and will make an attempt to translate from NTFS permissions to Unix permissions or vice versa when extracting files. This can result in potentially unintended combinations, e.g. .exe files being created on NTFS volumes with executable permission denied.

Versions of Microsoft Windows have included support for .ZIP compression in Explorer since the Microsoft Plus! pack was released for Windows 98. Microsoft calls this feature "Compressed Folders". Not all .ZIP features are supported by the Windows Compressed Folders capability. For example, encryption is not supported in Windows 10 Home edition,[59] although it can decrypt. Unicode entry encoding is not supported until Windows 7, while split and spanned archives are not readable or writable by the Compressed Folders feature, nor is AES Encryption supported.[60] Windows .zip support stemmed from an acquisition of "VisualZip" written by Dave Plummer.[61][62][63]

OpenDocument Format (ODF) started using the zip archive format in 2005, ODF is an open format for office documents of all types, this is the default file format used in Collabora Online, LibreOffice and others.[64] Microsoft Office started using the zip archive format in 2006 for their Office Open XML .docx, .xlsx, .pptx, etc. files, which became the default file format with Microsoft Office 2007.

Internationalization issues

[edit]

Versions of the format prior to 6.3.0 did not support storing file names in Unicode.[65] According to the standard,[65] file names should be stored in the CP437 encoding, which is standard for the IBM PC,[65] but in practice, DOS archivers used the system's installed character encoding. The built-in archiver of Windows up to 11 also used the DOS encoding corresponding to the selected system language for backward compatibility when creating archives. Subsequently, the standard was updated to include two options for storing file names in Unicode: 1) when the 11th bit in the General purpose bit flag field is set, the file name in the "File name" field of the header should be considered as UTF-8 rather than a single-byte encoding, and 2) the Unicode Path Extra Field was added to store the file name in UTF-8 encoding.[65] Some versions of archivers on the Windows platform have also used ANSI encoding in the past. Thus, to correctly extract files with names containing non-English characters, it is necessary:[66]

  1. Check for the presence of the Unicode Path Extra Field, and if it exists, use the filename from it, encoded in UTF-8.
  2. Check for the presence of flag 11 in the General purpose bit flag field, and if it is set, consider the filename encoding in the "File name" field to be UTF-8.
  3. If the "packing OS" field contains the value 11 (NTFS, Windows), and the "version of the packer" field value is greater than or equal to 20, consider the filename encoding in the "File name" field to be the ANSI (Windows) encoding corresponding to the system locale if one can be determined; otherwise, use CP437.
  4. If the "packing OS" field contains the value 0 (FAT, DOS), and the "version of the packer" field value is between 25 and 40 inclusive, consider the filename encoding in the local header's "File name" field to be ANSI (Windows) encoding, and in the central header's "File name" field to be OEM (DOS) encoding, corresponding to the system locale if one can be determined; otherwise, use CP437.
  5. In other cases, if the "OS packing" field contains the value 0 (FAT, DOS), 6 (HPFS, OS/2), or 11 (NTFS, Windows), consider the filename encoding in the "File name" field to be OEM (DOS) encoding, corresponding to the system locale if one can be determined; otherwise, use CP437.
  6. In all other cases, consider the filename encoding in the "File name" field to be the system encoding of operating system unpacker is running on.

Some implementations of zip unpackers did not implement this algorithm or only partially implemented it, as a result, when viewing the contents of an archive or extracting it, users saw a chaotic set of characters, known as "mojibake", instead of letters of the national alphabet. In 2016, this problem was solved in the far2l file and archive manager for Linux, BSD and Mac.[67] In 2024, similar solution was added[68] to the version of 7zip used in the Debian distribution and its derivatives, and to the version of unzip used in the Ubuntu distribution and its derivatives.[66]

Legacy

[edit]

There are numerous other standards and formats using "zip" as part of their name. For example, zip is distinct from gzip, and the latter is defined in IETF RFC 1952. Both zip and gzip primarily use the DEFLATE algorithm for compression. Likewise, the ZLIB format (IETF RFC 1950) also uses the DEFLATE compression algorithm, but specifies different headers for error and consistency checking. Other common, similarly named formats and programs with different native formats include 7-Zip, bzip2, and rzip.

Concerns

[edit]

The theoretical maximum compression factor for a raw DEFLATE stream is about 1032 to one,[69] but by exploiting the ZIP format in unintended ways, ZIP archives with compression ratios of billions to one can be constructed. These zip bombs unzip to extremely large sizes, overwhelming the capacity of the computer they are decompressed on.[70]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The ZIP file format is an archive file format that supports data compression, error detection, and file encryption for storing multiple files and directories in a single file, enabling efficient cross-platform data exchange and storage. It was developed by Phil Katz and introduced in 1989 as part of the PKZIP shareware utility by PKWARE, Inc., evolving from earlier formats like ARC and providing a public-domain specification for compression. The format's structure consists of local file headers preceding each compressed file's data, followed by an optional data descriptor, a central directory listing all files with metadata, and an end-of-central-directory record for locating the directory; this design allows for streaming, splitting into multi-part archives, and random access to contents. Primarily using the Deflate compression algorithm, ZIP also accommodates methods like BZIP2, LZMA, and others via extensible compression method codes, while supporting encryption options such as traditional PKWARE, WinZip AES, and certificate-based schemes. Maintained by PKWARE since Katz's death in 2000, the specification—detailed in the APPNOTE document—has advanced through versions up to 6.3.10 (2022), incorporating features like ZIP64 for files exceeding 4 GB, Unicode filename support via UTF-8, and digital signatures for integrity. As a de facto industry standard, ZIP is integral to formats like Office Open XML (OOXML) and Open Document Format (ODF), ensuring broad interoperability despite its proprietary origins.

History

Early Development and Versions

The ZIP file format was invented in 1989 by , founder of PKWARE, Inc., as part of the archiving utility for systems. This development was prompted by a legal dispute with Systems Enhancement Associates (SEA), creators of the ARC format; Katz had previously released PKARC, a compatible archiver, leading to a 1988 lawsuit for and trademark violation, which resulted in a settlement prohibiting further ARC-compatible products. To address this, Katz designed ZIP as an independent alternative, releasing it into the alongside version 0.9 in February 1989, emphasizing efficient compression and file management for personal computers. The initial version 1.0 of , released later in 1989, introduced core ZIP features including support for basic compression methods such as Shrinking (a variant of Lempel-Ziv-Welch algorithm) and Reduce (a predictive dictionary-based method with levels 1 through 4), alongside an uncompressed store option. It also established the central , a key organizational element that lists all files at the archive's end for efficient access, along with local file headers containing metadata like timestamps and compression details. Subsequent versions evolved the format through incremental updates documented in PKWARE's APPNOTE specification, with major enhancements focusing on improved compression and security. The table below summarizes key version milestones:
VersionRelease YearKey Changes
1.01989Initial release with Shrinking, Reduce, and central directory; basic file archiving.
2.01993Added Deflate compression method (method ID 8), which combines LZ77 and Huffman coding for better efficiency; introduced traditional PKWARE encryption (ZipCrypto).
4.52001Introduced Strong Encryption Specification (SES) with DES, 3DES, RC2, and RC4 for improved security (version needed 45-49).
6.12003Added support for AES encryption (128/192/256-bit, method 99, via WinZip extension); ZIP64 extensions for files >4 GB using extra fields (introduced 2002).
6.32006Enhanced extra field and Unicode (UTF-8) filename support; specification revised to 6.3.10 (2022) with minor updates like additional extra field mappings and no fundamental changes since early 2000s.
PKWARE has maintained the ZIP specification through the APPNOTE document series since 1989, initially as a guide bundled with and later updated publicly to detail format evolution, ensuring across tools. This effort laid the groundwork for broader standardization in the .

Standardization and Adoption

The ZIP file format lacks a formal standardization body such as ISO, but it has achieved de facto standardization through PKWARE's ongoing maintenance of the APPNOTE.TXT specification, first published in 1989 and periodically updated to ensure interoperability across implementations. This document serves as the primary reference for the format's structure and features, with versions tracked from 5.2 onward, emphasizing backward compatibility where possible. PKWARE has committed to public availability of the specification since its inception, facilitating broad adoption without proprietary restrictions on core elements. In the early , the Info-ZIP group formed to address the need for open-source ZIP-compatible tools beyond DOS environments, reverse-engineering the format from binaries to develop portable implementations. Established via a in March 1990, the group released the Zip and UnZip utilities, which provided high-quality, free compression and extraction capabilities across systems and other platforms. These efforts democratized access to ZIP handling, with Info-ZIP's software becoming a cornerstone for non-proprietary support and influencing extensions like filename handling documented in later APPNOTE versions. The format's compression core, , received informal IETF standardization through RFC 1951 in 1996, defining the LZ77 and combination used in ZIP archives. This RFC, authored by , ensured consistent implementation across ZIP and related formats like , without encompassing the full ZIP container specification. By the 2000s, ZIP's ubiquitous OS integration—absent a formal ISO endorsement—solidified its status as a universal archive standard, supported natively or via standard libraries in major systems. Adoption accelerated in the 1990s through third-party tools like , released in 1991, which popularized ZIP on Windows by integrating seamlessly with the Explorer shell and handling large archives. Native support followed in (2000) for basic compression and extraction. On macOS, built-in ZIP handling arrived with OS X 10.3 (Panther) in 2003 via the Archive Utility, building on earlier Info-ZIP ports. Linux distributions incorporated Info-ZIP's Zip and UnZip as standard packages from the mid-1990s, evolving into default tools like those in file managers such as . This cross-platform momentum transformed ZIP from a shareware-era utility into the default archive format for file distribution and backups by the early 2000s.

Format Specification

Overall Structure

The ZIP file format organizes data in a sequential, hierarchical manner to facilitate efficient archiving and extraction of multiple files. At its core, the structure consists of local file headers followed immediately by the corresponding file data (either compressed or stored uncompressed), repeated for each file in the archive. This is appended by a central directory that serves as an index containing metadata for all files, and concludes with an end-of-central-directory (EOCD) record that provides the offset and size of the central directory. This layout ensures that the archive can be appended to without rewriting the entire file, while allowing random access during extraction by first locating the central directory via the EOCD. During the archiving process, files are packed sequentially: each begins with a local file header providing essential metadata, followed by the file's payload, and optionally a data descriptor for certain compression scenarios. The central directory then aggregates comprehensive metadata for navigation, enabling tools to list contents or extract specific files without scanning the entire from the start. Extraction typically begins by seeking to the end of the file to read the EOCD, which points to the central directory; from there, offsets guide access to individual local headers and . This design supports both streaming creation and efficient querying. Key concepts in the include the distinction between uncompressed and compressed sizes, where the original (uncompressed) and the reduced size after any applied compression are recorded in metadata to verify and allocate space during extraction. , such as permissions, timestamps, and external attributes for platform-specific details, are embedded to preserve file properties across systems. Directories are supported by treating them as special entries, often indicated by a trailing slash in the or specific attribute flags, allowing without storing empty directory data. Compression may be applied to sections to reduce storage, but the itself remains independent of the method used. A textual representation of the ZIP file layout is as follows:

[Local File Header 1] [File Data 1] [Optional Data Descriptor 1] [Local File Header 2] [File Data 2] [Optional Data Descriptor 2] ... [Central Directory Header 1] [File Metadata 1] [Central Directory Header 2] [File Metadata 2] ... [End of Central Directory (EOCD) Record]

[Local File Header 1] [File Data 1] [Optional Data Descriptor 1] [Local File Header 2] [File Data 2] [Optional Data Descriptor 2] ... [Central Directory Header 1] [File Metadata 1] [Central Directory Header 2] [File Metadata 2] ... [End of Central Directory (EOCD) Record]

This arrangement positions the central directory and EOCD at the end for append-friendly operations.

File Headers and Records

The ZIP file format organizes its metadata through a series of binary headers and records that provide essential information about the archive's contents, including file attributes, locations, and integrity checks. These structures ensure that archives can be efficiently parsed and reconstructed, with fixed-size signatures for identification and variable-length fields for flexibility. All headers begin with a 4-byte signature in little-endian byte order to distinguish their type, as defined in the official ZIP specification. The local file header precedes each file's compressed data and contains core metadata for that entry. It is a fixed 30-byte structure followed by variable-length components. The header starts with the signature 0x04034b50 (PK\003\004 in ASCII). This is followed by a 2-byte version needed to extract (indicating the minimum ZIP version required, such as 20 for version 2.0), a 2-byte general purpose bit flag (used for options like encryption or data descriptor presence), and a 2-byte compression method identifier. Next are 2-byte fields for the last modification file time and date (in MS-DOS format), a 4-byte CRC-32 checksum for integrity verification, and 4-byte fields for the compressed and uncompressed sizes. The structure concludes with 2-byte lengths for the filename and extra field. The filename follows as a variable-length UTF-8 or code page string (no null terminator), and the extra field provides extensible metadata. This header enables direct access to individual files without parsing the entire archive. If the general purpose bit flag's bit 3 is set, a data descriptor follows the compressed data instead of including sizes and CRC in the local header (useful for streaming scenarios where sizes are unknown upfront). This optional 12-byte structure (or 16 bytes with an optional signature) includes a 4-byte CRC-32, followed by 4-byte compressed and uncompressed sizes. In ZIP64 extensions for archives exceeding 4 GB, these size fields extend to 8 bytes. The optional signature 0x08074b50 precedes the fields when present, aiding parsers in identifying the descriptor. Its purpose is to supply post-compression metadata for verification and size reporting. The central directory file header (CDFH) forms the archive's index, with one 46-byte record per file collected in the central directory section near the end of the file. It begins with the signature 0x02014b50 (PK\001\002). This is followed by a 2-byte version made by (indicating the ZIP version used to create the entry), the 2-byte version needed to extract, 2-byte general purpose bit flag, and 2-byte compression method, mirroring the local header. It then includes the 2-byte last modification time, 2-byte date, 4-byte CRC-32, 4-byte compressed size, and 4-byte uncompressed size. Additional fields are 2-byte filename length, 2-byte extra field length, and 2-byte file comment length. Further details comprise a 2-byte disk number start (for multi-disk archives), 2-byte internal file attributes (e.g., text/binary flags), 4-byte external file attributes (platform-specific, like Unix permissions), and a 4-byte relative offset to the local header. Variable-length filename, extra field, and comment follow. The CDFH allows quick lookup of all files and their positions. The end of central directory record (EOCD) marks the archive's conclusion with a 22-byte fixed structure plus a variable comment. It starts with the signature 0x06054b50 (PK\005\006), followed by 2-byte number of this disk, 2-byte disk number containing the central directory start, 2-byte total entries on this disk, 2-byte total central directory entries, 4-byte central directory size, 4-byte offset to central directory start, and 2-byte ZIP file comment length. The comment, if any, follows as a variable-length string. For large archives, ZIP64 variants replace limited fields with extensible 8-byte or larger equivalents via extra fields. This record enables parsers to locate the central directory by searching backward from the file end. Extra fields provide a mechanism for vendor-specific or extended metadata, integrated into local headers, CDFHs, and EOCD. Each extra field is a variable-length block starting with a 2-byte header ID (0-65535, with 0-31 reserved for PKWARE) and 2-byte data size, followed by the data block. The total extra field length in headers specifies the combined size of all such blocks. This structure supports future extensions without breaking compatibility. Filenames, file comments, and ZIP comments are handled as variable-length strings, with lengths prefixed by 2-byte fields in their respective headers to indicate exact byte counts. These strings use encoding by default in modern implementations (signaled via bit flag 11), falling back to or OEM encodings otherwise, and are not null-terminated to optimize space. This approach allows efficient parsing while accommodating international characters.
StructureFixed Size (bytes)SignatureKey Variable Components
Local File Header300x04034b50, Extra Field
Data Descriptor12-160x08074b50 (optional)None
Central Directory File Header460x02014b50, Extra Field, File Comment
End of Central Directory Record220x06054b50ZIP Comment

Compression Methods

The ZIP file format supports multiple lossless compression algorithms, each identified by a two-byte method number stored in the file header, allowing creators to select based on desired trade-offs between , speed, and computational resources. Early methods from the format's in 1989 include options optimized for the hardware of the time, while later additions leverage more advanced techniques for improved efficiency on modern systems. The simplest method, number 0 (Store), applies no compression and stores data verbatim, useful for files that are already compressed or where speed is prioritized over size reduction. Obsolete early methods include number 1 (Shrunk), a variant of the Lempel-Ziv-Welch (LZW) algorithm using dynamic code tables starting at 9 bits and expanding to 13 bits with partial clearing to manage dictionary growth. Methods 2 through 5 (Reduced) combine adaptive Lempel-Ziv-style sequence encoding with probabilistic modeling via follower sets of 0 to 32 characters, where higher reduction factors (1-4) yield better ratios at the cost of increased processing time. Method 6 (Imploded) employs a sliding dictionary of 4 KB or 8 KB with Shannon-Fano coding trees (either 2 or 3), offering configurable minimum match lengths of 2 or 3 characters; larger dictionaries improve ratios but slow compression and decompression. These early methods (1-6) are largely deprecated in favor of more efficient alternatives due to their inferior performance on contemporary hardware. The most widely used method since 1993 is number 8 (), which combines the LZ77 dictionary-based algorithm with to achieve balanced compression. operates on blocks of using a 32 KB sliding window for LZ77 references, supporting three block types: uncompressed (for literal ), fixed Huffman codes, or dynamic Huffman codes that adapt to the input for better efficiency. Compression levels are controlled via general-purpose bits 1-2, ranging from normal (balanced) to maximum (highest ratio, slower) or faster modes (lower ratio, quicker), with typical ratios of 2.5-3:1 on English text. LZ77 variants include method 9 ( or Enhanced Deflating), which extends the window to 64 KB for superior ratios on large files, though at reduced speed. For higher compression ratios, method 12 () uses the Burrows-Wheeler block-sorting transform followed by and , often outperforming on text-heavy data but requiring more time and memory. Method 14 (LZMA) applies the Lempel-Ziv-Markov chain , an advanced LZ77 variant with a large (up to 4 GB) and adaptive probability modeling, achieving excellent ratios—particularly on structured data—at the expense of slower processing compared to . Other methods include 98 (PPMd), a by partial matching using context-based statistical modeling for very high ratios on text, though computationally intensive. The specification also defines additional compression methods, including 93 (Zstandard, a modern algorithm offering high compression ratios with fast decompression), 95 (XZ, using LZMA2 for block-based compression), 94 ( audio compression), 96 ( image variant), 97 ( for lossless audio), and 19 (IBM LZ77 z Architecture for mainframe environments). These extend the format's versatility for specialized data types. Uncompressed data is handled via method 0 or Deflate's uncompressed block type, preserving original bytes without alteration. Deflate supports multi-part streams through sequential blocks, where the final block is flagged and LZ77 references span up to 32 KB across boundaries, enabling continuous processing of large inputs.

Security Features

Encryption Mechanisms

The ZIP file format supports primarily at the file level, where individual files within the can be encrypted independently, though -level encryption of metadata is also possible through specific flags. is applied to the compressed data stream after compression, ensuring that the confidentiality mechanism operates on the reduced data size. Traditional PKWARE encryption, introduced with ZIP version 2.0 in 1993, uses a known as ZipCrypto with a 96-bit effective key derived from a user-provided via a CRC-32-based . The key consists of three 32-bit values initialized from the password and updated for each byte using CRC-32 operations, producing a keystream that XORs with the compressed data starting after a 12-byte encrypted header containing the first 12 bytes of the keystream and a CRC verification. An is generated from the CRC-32 of the uncompressed file and its size, but the scheme's reliance on a known structure in the header makes it vulnerable to attacks requiring as little as 12-13 bytes of known data. This method is signaled by compression method 0 and general purpose bit flag 0 in the local file header. In 2003, introduced an AES-based extension using compression method 99, supporting 128-bit, 192-bit, or 256-bit keys derived from the password via with 1,000 iterations of HMAC-SHA1 and a random salt of 8, 12, or 16 bytes respectively. This extension employs AES in Counter (CTR) mode with a 16-byte block size and a little-endian counter starting from zero, followed by a (MAC) for integrity in its AE-2 variant, which masks the CRC-32 field in headers to prevent leakage. The encryption details are stored in an extra field with ID 0x9901, including vendor "AE", version (0x0001 for AE-1, 0x0002 for AE-2), the algorithm ID (0x660E/0x660F/0x6610 for the key sizes), and the salt plus a 2-byte password verification value. Unlike traditional , this provides stronger but requires proprietary signaling, limiting interoperability without specific support. ZIP also supports certificate-based encryption using , introduced in version 5.x of the specification. This method employs X.509v3 digital certificates in format, stored in extra fields such as ID 0x0014 for the certificate store, ID 0x0015 for file signatures, and ID 0x0017 for the strong encryption header specifying the algorithm (e.g., AES, 3DES, ) and key wrapping. A master is generated randomly and wrapped with the recipient's public key for decryption requiring the private key, enabling secure sharing without passwords. This supports non-OAEP wrapping for compatibility with hardware tokens and is signaled via general purpose bit flag 4. ZIP encryption operates at the file level by default, encrypting only the compressed file data and leaving filenames and metadata in the unencrypted central directory, which exposes file names even in protected archives unless the central directory encryption (bit 13) is set for methods. When central directory encryption is enabled, an archive decryption header precedes the encrypted metadata, and local headers may have masked fields (e.g., sizes and CRC set to zero or random values) to obscure details, with bit 13 indicating such masking. Non-encrypted filenames remain visible in the central directory unless the entire archive metadata is protected, a feature supported only in strong encryption implementations. Due to its cryptographic weaknesses, traditional PKWARE encryption has been deprecated for new archives in modern tools like 7-Zip and WinRAR, which support it only for compatibility with legacy files while recommending AES-based methods for security.

Integrity and Authentication Features

ZIP supports digital signatures to verify the integrity and authenticity of files and the archive, introduced in later versions of the specification. Signatures are stored in extra fields, such as ID 0x0012 for digital signatures on individual files, using formats like PKCS#7 to sign the file contents and metadata. The central directory can also include a signature (extra field ID 0x0016) over the entire directory for archive-level verification. These features allow recipients to confirm that files have not been tampered with using public-key infrastructure, complementing CRC-32 checks with cryptographic strength. Support requires compatible tools, and signatures are optional to maintain backward compatibility.

Known Vulnerabilities

The ZIP Slip vulnerability, disclosed in 2018, enables path traversal attacks by allowing malicious ZIP files to include filenames with relative paths such as "../" sequences, which can escape intended extraction directories and overwrite arbitrary files on the target system during decompression. This affects implementations that fail to sanitize or normalize file paths before extraction, potentially leading to remote code execution or data tampering in vulnerable applications. ZIP bombs, also known as decompression bombs, exploit the format's compression capabilities to create archives that appear small but expand dramatically upon extraction, causing denial-of-service conditions through memory exhaustion or disk overflow. These attacks often involve highly compressible patterns, such as repeated sequences, that can balloon a few kilobytes into gigabytes or more, overwhelming system resources in naive decompressors. The CRC-32 used in ZIP for file integrity verification is susceptible to collisions, allowing attackers to modify file contents while preserving the , thus enabling undetected tampering in scenarios without additional protections. As a non-cryptographic , CRC-32 can be deliberately collided with relatively low computational effort, undermining its reliability against intentional alterations. Traditional ZIP encryption, predating AES support, relies on a weak stream cipher derived from user passwords and exhibits vulnerabilities including known-plaintext attacks and predictable key streams, making it prone to brute-force cracking or partial recovery of contents. Due to these flaws, such as the use of CRC-32 for password verification which leaks information after a few attempts, experts recommend exclusive use of the stronger AES-based encryption methods introduced in later ZIP extensions. These vulnerabilities have manifested in numerous software implementations, including CVEs in libraries like Apache Commons Compress, where crafted ZIP files trigger infinite loops or excessive memory allocation leading to denial-of-service. For instance, CVE-2021-36090 in Apache Commons Compress allows remote attackers to cause out-of-memory errors via specially constructed archives. Mitigations typically involve path normalization for ZIP Slip, resource limits for bombs, and enforcement of cryptographic hashes beyond CRC-32 for integrity.

Extensions and Variations

ZIP64 Extensions

The ZIP64 extensions were introduced in PKZIP version 4.5 in 2001 to address the 32-bit limitations of the original ZIP format, enabling support for individual files larger than 4 GB, archives exceeding 4 GB in total size, more than 65,535 files per archive, and multi-disk archives spanning more than 65,535 disks. These extensions were detailed in updates to the PKWARE APPNOTE specification, which defines the use of 64-bit (8-byte) fields for critical values such as uncompressed and compressed sizes, relative offsets, and disk numbers wherever 32-bit fields would otherwise overflow. Central to ZIP64 are the extra fields with header ID 0x0001, which extend the local file headers and central directory file headers. These variable-length fields include 8-byte values for the uncompressed size, compressed size, relative offset of the local header, and starting disk number, allowing parsers to access extended information when present. For the end of central directory (EOCD) , ZIP64 introduces two additional : the ZIP64 EOCD record (signature PK\x06\x06, fixed size of 56 bytes plus optional extensible data) containing 8-byte fields for the number of entries on the current disk, total number of entries, size of the central directory, and offset of the ZIP64 EOCD start; and the ZIP64 EOCD locator (signature PK\x07\x06, fixed 20-byte size) specifying the disk number containing the ZIP64 EOCD, its 8-byte offset, and the total number of disks. These replace the standard 32-bit EOCD when thresholds are exceeded, such as more than 65,535 files or sizes over 4 GB. Backward compatibility is maintained by setting overflowed 32-bit fields in the standard headers to 0xFFFFFFFF, prompting compliant parsers to consult the ZIP64 extra fields or records for the actual 64-bit values. The extensions do not alter any compression methods defined in the ZIP specification, ensuring that data compression and decompression remain unchanged. ZIP64 is invoked when any component surpasses the original limits—specifically, uncompressed or compressed file sizes over 4,294,967,295 bytes, central directory offsets beyond that threshold, or file counts exceeding 65,535—and is supported by tools implementing ZIP format version 4.5 or later, including modern implementations of and compatible libraries.

Open and Proprietary Extensions

The ZIP file format supports extensibility through extra fields, which allow for the addition of supplementary without altering the core structure. These extra fields follow a standardized format consisting of a 2-byte header ID, a 2-byte data size (indicating the length of the following block), and the variable-length itself, all stored in little-endian byte order. However, the use of incompatible or unrecognized extra fields can lead to interoperability issues, as ZIP readers may ignore or mishandle unknown IDs, potentially resulting in or extraction failures across different implementations. Open extensions to the ZIP format include the Info-ZIP Unicode Path Extra Field, identified by header ID 0x7075, which stores encoded file names along with a for verification; this extension was introduced in to address limitations in the original code page-based naming. Another open extension is the Seek-Optimized ZIP (SOZip) profile, which organizes Deflate-compressed files with annotations enabling and selective decompression without full archive extraction, making it suitable for and large geospatial datasets; SOZip adheres to ZIP standards while adding specific central directory records for offset information. Additionally, Info-ZIP introduced a general-purpose flag in bit 11 of the file header to signal encoding for file names and comments, providing a lightweight alternative to extra fields for compatible tools. Proprietary extensions include WinZip's signaling for AES encryption via compression method 99 in the file header, combined with an extra field (ID 0x9901) containing encryption details such as key length and vendor version, enabling stronger security than traditional ZIP encryption. WinZip also employs proprietary extra fields to embed image thumbnails, allowing preview functionality within the archiver without extracting files. PKWARE's extensions incorporate digital signatures through the Strong Encryption Header extra field (ID 0x0017), which includes certificate data for verifying file integrity and authenticity in encrypted archives. To prevent conflicts among extensions, the APPNOTE specification reserves header IDs 0x0000 through 0x001F for PKWARE use and recommends that third-party developers select unique IDs with a distinctive signature pattern, submitting proposals to the ZIP File Specification Committee for review and inclusion.

Limitations and Compatibility

Inherent Format Limits

The base ZIP file format, as defined in the original specification, imposes several 32-bit architectural limits that constrain its utility for large-scale archiving. The maximum size for both compressed and uncompressed individual files is 4,294,967,295 bytes (approximately 4 GB), due to the use of 32-bit unsigned integer fields for these values. Similarly, the total number of files within an archive cannot exceed 65,535, as this is governed by a 16-bit field in the end-of-central-directory record. Filename lengths are also capped at 65,535 bytes, reflecting another 16-bit limit, while the offset of the central directory from the start of the archive must be less than 4 GB to fit within a 32-bit field. These constraints stem directly from the format's design in the late 1980s, when 32-bit addressing was standard for personal computers. Additional inherent caps further limit flexibility. Disk numbers in multi-volume archives are represented with 16 bits, restricting the format to a theoretical maximum of 65,536 disks (though practical implementations rarely approach this). The base format lacks native support for Unicode, relying instead on the 8-bit IBM Code Page 437 character set, which is effectively limited to ASCII and some extended characters. Timestamps are stored in MS-DOS format with second-level precision, providing granularity only to the nearest even second (0–62 seconds per minute). For multi-volume archives, the specification supports splitting across disks or files, but each volume is inherently limited to under 4 GB, and common implementations use naming conventions like .zip for the first volume followed by .z00, .z01, and so on for subsequent parts, with inconsistent support for volumes exceeding 2 GB prior to extensions. These limits became increasingly problematic as storage capacities and file sizes grew in the 1990s and beyond, necessitating extensions like ZIP64 to accommodate modern needs such as terabyte-scale files, millions of entries in archives, and international filenames.
Limit CategoryBase ZIP ConstraintModern Context/Need
Individual File Size≤ 4 GB - 1 byteFiles often exceed 4 GB (e.g., high-resolution videos, databases); ZIP64 allows up to 16 EB.
Number of Files per Archive≤ 65,535Large projects may include millions of small files (e.g., software builds); ZIP64 supports up to 2^64 - 1.
Filename Length≤ 65,535 bytesModern filesystems support longer paths with ; base ZIP restricts to ASCII-like encoding.
Archive/Volume SizeCentral directory offset < 4 GB; volumes < 4 GBArchives routinely surpass 4 GB (e.g., backups, distributions); multi-volume splitting inadequate for >2 GB volumes without extensions.

Implementation Challenges

Implementing ZIP file format support presents several practical challenges, particularly in handling internationalization of filenames. Early versions of the ZIP specification limited filenames to ASCII-compatible characters based on the MS-DOS code page 437 (CP437), which often led to mojibake—garbled text—when processing files with non-Latin characters on systems using different encodings. To address this, the specification evolved to include a language encoding flag (bit 11 in the general purpose bit flag of file headers) that indicates UTF-8 encoding for filenames and comments, introduced in revisions around 2006. However, inconsistent adoption across tools means that many older libraries and applications default to CP437 or other local code pages, resulting in unreliable display and extraction of internationalized filenames unless the flag is explicitly checked and supported. Endianness introduces another layer of complexity, as the ZIP format mandates little-endian byte order for all multi-byte numeric fields, such as file sizes, offsets, and timestamps. On big-endian architectures like certain PowerPC or systems, implementers must perform byte swaps during reading and writing to ensure correct interpretation, which can introduce performance overhead and risks of errors if not handled uniformly across the codebase. Failure to account for this can lead to misread headers, corrupted extractions, or invalid archives, especially in cross-compilation scenarios. The ZIP structure's reliance on a central directory at the end of the file creates hurdles for streaming and random access operations. To decompress, parsers must first locate and read the end-of-central-directory record backward from the file's tail, then seek to individual local file headers—necessitating random access or a two-pass process for full archive handling. This design, optimized for single-pass creation, complicates append-only scenarios, network transmission where files arrive incrementally, or memory-constrained environments, often requiring temporary buffering of the entire archive or custom extensions like streamable ZIP variants. Library implementations frequently encounter issues related to partial compliance with the official APPNOTE specification, such as ignoring optional extra fields that store extended metadata. Older parsers, including those in Info-ZIP's UnZip, have suffered from buffer overflows when processing malformed headers or oversized fields like filenames in ZIP64 extensions, potentially leading to denial-of-service or code execution vulnerabilities (e.g., CVE-2018-1000035). Ensuring full adherence to the spec, including robust bounds checking and handling of variable-length fields, remains a ongoing challenge for developers to avoid such flaws. Cross-platform compatibility adds further difficulties in mapping file attributes and timestamps. The external file attributes field uses 32 bits, with the lower 16 bits for attributes (e.g., read-only, hidden) and the upper 16 bits for Unix-style permissions (e.g., owner/group/other read/write/execute bits), but not all tools preserve or interpret these consistently across operating systems. Timestamps are stored in format (DOSDATE and DOSTIME fields) without timezone information, assuming at creation, which causes discrepancies when extracting on systems in different timezones—files may appear offset by hours or days unless additional extra fields (like timestamps) are used and supported.

Legacy and Impact

Historical Significance

The ZIP file format, introduced in 1989 by Phil Katz through his company PKWARE, emerged from a contentious legal dispute with System Enhancement Associates (SEA), the creators of the earlier ARC archiving utility. Katz had initially developed PKARC, a program that improved upon ARC's compression but allegedly incorporated elements of its code and interface, prompting SEA to sue PKWARE in 1988 for copyright infringement, trademark violation, and unfair competition. The settlement required Katz to cease distribution of PKARC, leading him to create the entirely new ZIP format as a replacement. To counter the proprietary restrictions of ARC and promote broader adoption, Katz released the ZIP specification into the public domain later that year, allowing developers worldwide to implement it freely without licensing fees. This move democratized file compression technology, shifting it from a controlled, litigious landscape to an open ecosystem that accelerated innovation in data archiving. ZIP revolutionized file distribution in the pre-broadband era, particularly through and early protocols like FTP. By the early 1990s, its efficient compression—reducing file sizes by up to 50% or more for typical data—made it the de facto standard for sharing software, documents, and binaries over limited bandwidth connections, supplanting ARC entirely within the BBS community. FTP archives on academic and public servers proliferated with ZIP files during the decade, enabling the rapid exchange of and that fueled the growth of online communities. This ubiquity extended to in the late 1990s and early 2000s, where ZIP archives became a staple for bundling music collections and other media, facilitating the explosive rise of digital content distribution before specialized formats like torrent emerged. (Note: Using as historical timeline source, but avoid direct reliance; cross-verified with contemporary accounts.) The format's influence permeated software development, serving as the foundational container for several high-impact standards. Java's JAR (Java Archive) files, introduced in 1997, are ZIP archives augmented with metadata for applets and applications, leveraging ZIP's for cross-platform portability. Microsoft's (OOXML) formats, adopted in 2007 for documents like DOCX, encapsulate XML content and media within a ZIP container under the , enabling smaller, more interoperable files than proprietary predecessors. Similarly, the standard for e-books, finalized by the International Digital Publishing Forum in 2007, uses ZIP as its mechanism to bundle , images, and metadata into a single, reflowable file. ZIP's dominance also spurred competitors: RAR (1993) by offered superior compression for multimedia, while (1999) from the project emphasized open-source efficiency, both emerging as alternatives to address ZIP's limitations in ratio and speed. Culturally, ZIP permeated everyday computing lexicon, with "zip" evolving into a verb synonymous with file compression by the mid-1990s, as in "zip up that folder" to prepare files for transfer—a shorthand that reflected its seamless integration into workflows. Its role in the 1990s-2000s file-sharing boom, from BBS uploads to early web downloads, made it indispensable for distributing zipped playlists and software bundles, embedding it in the digital culture of the internet's formative years. Key milestones underscore this trajectory: 1989 marked PKZIP's debut as ; by the end of the , PKWARE products were in use by more than 90% of Fortune 100 companies; and ZIP became a cornerstone of across operating systems like Windows and macOS.

Modern Usage and Concerns

The ZIP file format remains a cornerstone of modern file management, serving as the default for native compression tools across major operating systems. In Windows, built-in support allows users to compress and extract ZIP files directly through without additional software, facilitating everyday tasks like bundling documents for sharing. Similarly, macOS integrates ZIP handling via the Archive Utility, which automatically extracts archives when double-clicked and enables compression by right-clicking files or folders. This ubiquity extends to mobile ecosystems, where Android application packages (APKs) are structured as ZIP-based archives containing app resources, manifests, and code, essential for software distribution via platforms like . Beyond operating systems, ZIP plays a vital role in and due to its reliability and broad compatibility. It is commonly used for creating portable archives in backup utilities, reducing storage needs while preserving file integrity across devices. In software delivery, ZIP's structure supports efficient packaging, as seen in Android APKs, which bundle executable code and assets into a single, compressible file for seamless installation. Despite these strengths, alternatives like dominate in environments for their seamless integration with command-line tools and historical ties to tape archiving, offering robust handling of large datasets without proprietary dependencies. Formats such as provide superior compression ratios—often 30-70% better than ZIP for certain file types—making them preferable for space-constrained scenarios, yet ZIP's cross-platform universality keeps it dominant in mixed ecosystems where is prioritized over maximal efficiency. Contemporary concerns highlight ZIP's aging design in evolving digital landscapes. Vulnerabilities like ZIP slip, a path traversal flaw enabling arbitrary file overwrites during extraction, persist in cloud services; for instance, it has affected tools from AWS and HP, potentially allowing remote code execution if unpatched software processes untrusted archives. Legacy support for outdated compression methods and compatibility quirks across operating systems contributes to software bloat, as developers must maintain extensive code to handle variations in ZIP interpretation, increasing application complexity without proportional benefits. Additionally, there is a gradual shift toward specialized container formats like the OpenDocument Format (ODF), which uses ZIP as its underlying structure but adds metadata layers for standardized document interchange in office suites. Looking ahead, the ZIP specification remains stable, with the APPNOTE document last substantially revised in but no major structural updates since the ZIP64 extensions addressed size limits over a decade ago. Efforts to integrate modern standards, such as AES-256, have bolstered , allowing ZIP to align with contemporary cryptographic practices while maintaining . In data centers, ZIP's compression capabilities contribute to environmental by reducing storage footprints and transmission volumes. However, ZIP's role in web downloads, while still prevalent for smaller bundles, is declining relative to streaming protocols for large-scale transfer, reflecting broader trends toward real-time access over archived packaging. Recent advisories, such as CVE-2025-54368 affecting ZIP-handling tools as of August 2025, underscore the ongoing need for vigilant patching of extraction vulnerabilities.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.