Recent from talks
Nothing was collected or created yet.
Tar (computing)
View on WikipediaThis article needs additional citations for verification. (April 2012) |
| tar | |
|---|---|
| Original author | Bell Laboratories |
| Developers | Various open-source and commercial developers |
| Initial release | January 1979 |
| Stable release(s) | |
| Written in | pdtar, star, Plan 9, GNU: C |
| Operating system | Unix, Unix-like, Plan 9, Microsoft Windows, IBM i |
| Platform | Cross-platform |
| Type | Command |
| License | BSD tar: BSD-2-Clause GNU tar: GPL-3.0-or-later pdtar: Public domain Plan 9: MIT star: CDDL-1.0 |
| tar | |
|---|---|
| Filename extension |
.tar |
| Internet media type |
application/x-tar |
| Uniform Type Identifier (UTI) | public.tar-archive |
| Magic number | u s t a r \0 0 0 at byte offset 257 (for POSIX versions)
|
| Latest release | various various |
| Type of format | File archiver |
| Standard | POSIX since POSIX.1, presently in the definition of pax[1] |
| Open format? | Yes |
In computing, tar is a shell command for combining multiple computer files into a single archive file. It was originally developed for magnetic tape storage – reading and writing data for a sequential I/O device with no file system, and the name is short for the format description "tape archive". When stored in a file system, a file that tar reads and writes is often called a tarball.
A tarball contains metadata for the contained files including the name, ownership, timestamps, permissions and directory organization. As a file containing other files with associated metadata, a tarball is useful for software distribution and backup.
POSIX abandoned tar in favor of pax, yet tar continues to have widespread use.
History
[edit]The command was introduced to Unix in January 1979, replacing the tp program (which in turn replaced "tap").[7] The file structure was standardized in POSIX.1-1988[8] and later POSIX.1-2001,[9] and became a format supported by most modern file archiving utilities. The tar command was abandoned in POSIX.1-2001 in favor of pax, which was to support the ustar file format, and tar was indicated for withdrawal in favor of pax at least since 1994. None-the-less, many operating systems today include tools for tar files, as well as tools to compress and decompress them, such as xz, gzip, and bzip2.
The tar command was ported to the IBM i operating system.[10]
BSD-tar has been in Windows since 2018,[11][12] and there are other third-party tools available for Windows.
Rationale
[edit]Many historic tape drives read and write variable-length data blocks, leaving significant wasted space on the tape between blocks (for the tape to physically start and stop moving). Some tape drives (and raw disks) support only fixed-length data blocks. Also, when writing to any medium such as a file system or network, it takes less time to write one large block than many small blocks. Therefore, the tar command writes data in records of many 512 B blocks. The user can specify a blocking factor, which is the number of blocks per record. The default is 20, producing 10 KiB records.[13]
File format
[edit]There are multiple tar file formats, including historical and current ones. Two tar formats are codified in POSIX: ustar and pax. Not codified but still in current use is the GNU tar format.
A tar archive consists of a series of file objects, hence the popular term tarball, referencing how a tarball collects objects of all kinds that stick to its surface. Each file object includes any file data, and is preceded by a 512-byte header record. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes. The original tar implementation did not care about the contents of the padding bytes, and left the buffer data unaltered, but most modern tar implementations fill the extra space with zeros.[14] The end of an archive is marked by at least two consecutive zero-filled records. (The origin of tar's record size appears to be the 512-byte disk sectors used in the Version 7 Unix file system.) The final block of an archive is padded out to full length with zeros.
Header
[edit]The file header record contains metadata about a file. To ensure portability across different architectures with different byte orderings, the information in the header record is encoded in ASCII. Thus if all the files in an archive are ASCII text files, and have ASCII names, then the archive is essentially an ASCII text file (containing many NUL characters).
The fields defined by the original Unix tar format are listed in the table below. The link indicator/file type table includes some modern extensions. When a field is unused it is filled with NUL bytes. The header uses 257 bytes, then is padded with NUL bytes to make it fill a 512 byte record. There is no "magic number" in the header, for file identification.
Pre-POSIX.1-1988 (i.e. v7) tar header:
| Field offset | Field size | Field |
|---|---|---|
| 0 | 100 | File path and name |
| 100 | 8 | File mode (octal) |
| 108 | 8 | Owner's numeric user ID (octal) |
| 116 | 8 | Group's numeric user ID (octal) |
| 124 | 12 | File size in bytes (octal) |
| 136 | 12 | Last modification time in numeric Unix time format (octal) |
| 148 | 8 | Checksum for header record |
| 156 | 1 | Link indicator (file type) |
| 157 | 100 | Name of linked file |
The pre-POSIX.1-1988 Link indicator field can have the following values:
| Value | Meaning |
|---|---|
| '0' or (ASCII NUL) | Normal file |
| '1' | Hard link |
| '2' | Symbolic link |
Some pre-POSIX.1-1988 tar implementations indicated a directory by having a trailing slash (/) in the name.
Numeric values are encoded as octal numbers using ASCII digits, with leading zeroes. Whilst this choice may seem counterintuitive in modern times when the four bit hexadecimal notation is generally preferred because register sizes and address widths in computers are almost always powers of two and bytes have standardized to be octets, Unix was originally developed for the PDP-7, which uses an 18-bit CPU and a six-bit character code, making the three bit octal notation more desirable.
For historical reasons, a final NUL or space character should also be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, in 2001 star introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field.[citation needed] GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.
The checksum is calculated by taking the sum of the unsigned byte values of the header record with the eight checksum bytes taken to be ASCII spaces (decimal value 32). It is stored as a six digit octal number with leading zeroes followed by a NUL and then a space. Various implementations do not adhere to this format. In addition, some historic tar implementations treated bytes as signed. Implementations typically calculate the checksum both ways, and treat it as good if either the signed or unsigned sum matches the included checksum.
Unix filesystems support multiple links (names) for the same file. If several such files appear in a tar archive, only the first one is archived as a normal file; the rest are archived as hard links, with the "name of linked file" field set to the first one's name. On extraction, such hard links should be recreated in the file system.
UStar format
[edit]Most modern tar programs read and write archives in the UStar (Unix Standard TAR[7][15]) format, introduced by the POSIX IEEE P1003.1 standard from 1988. It introduced additional header fields. Older tar programs will ignore the extra information (possibly extracting partially named files), while newer programs will test for the presence of the "ustar" string to determine if the new format is in use.
The UStar format allows for longer file names and stores additional information about each file. The maximum filename size is 255, but it is split among a preceding path "filename prefix" and the filename itself, so can be much less.[16]
| Field offset | Field size | Field |
|---|---|---|
| 0 | 156 | (Several fields, same as in the old format) |
| 156 | 1 | Type flag |
| 157 | 100 | (Same field as in the old format) |
| 257 | 6 | UStar indicator, "ustar", then NUL |
| 263 | 2 | UStar version, "00" |
| 265 | 32 | Owner user name |
| 297 | 32 | Owner group name |
| 329 | 8 | Device major number |
| 337 | 8 | Device minor number |
| 345 | 155 | Filename prefix |
The type flag field can have the following values:
| Value | Meaning |
|---|---|
| '0' or (ASCII NUL) | Normal file |
| '1' | Hard link |
| '2' | Symbolic link |
| '3' | Character special |
| '4' | Block special |
| '5' | Directory |
| '6' | FIFO |
| '7' | Contiguous file |
| 'g' | Global extended header with meta data (POSIX.1-2001) |
| 'x' | Extended header with metadata for the next file in the archive (POSIX.1-2001) |
| 'A'–'Z' | Vendor specific extensions (POSIX.1-1988) |
| All other values | Reserved for future standardization |
POSIX.1-1988 vendor specific extensions using link flag values 'A'–'Z' partially have a different meaning with different vendors and thus are seen as outdated and replaced by the POSIX.1-2001 extensions that also include a vendor tag.
Type '7' (Contiguous file) is formally marked as reserved in the POSIX standard, but was meant to indicate files which ought to be contiguously allocated on disk. Few operating systems support creating such files explicitly, and hence most TAR programs do not support them, and will treat type 7 files as if they were type 0 (regular). An exception is older versions of GNU tar, when running on the MASSCOMP RTU (Real Time Unix) operating system, which supported an O_CTG flag to the open() function to request a contiguous file; however, that support was removed from GNU tar version 1.24 onwards.
POSIX.1-2001/pax
[edit]In 1997, Sun proposed a method for adding extensions to the tar format. This method was later accepted for the POSIX.1-2001 standard. This format is known as extended tar format or pax format. The new tar format allows users to add any type of vendor-tagged vendor-specific enhancements. The following tags are defined by the POSIX standard:
- atime, mtime: all timestamps of a file in arbitrary resolution (most implementations use nanosecond granularity)
- path: path names of unlimited length and character set coding
- linkpath: symlink target names of unlimited length and character set coding
- uname, gname: user and group names of unlimited length and character set coding
- size: files with unlimited size (the historic tar format is 8 GB)
- uid, gid: userid and groupid without size limitation (the historic tar format is limited to a max. id of 2097151)
- a character set definition for path names and user/group names (UTF-8)
In 2001, the Star program became the first tar to support the new format.[citation needed] In 2004, GNU tar supported the new format,[17] though it does not write it as its default output from the tar program yet.[18]
The pax format is designed so that all implementations able to read the UStar format will be able to read the pax format as well. The only exceptions are files that make use of extended features, such as longer file names. For compatibility, these are encoded in the tar files as special x or g type files, typically under a PaxHeaders.XXXX directory.[19]: exthdr.name A pax-supporting implementation would make use of the information, while non-supporting ones like 7-Zip would process them as additional files.[20]
Features of the archival utilities
[edit]Besides creating and extracting archives, the functionality of the various archival utilities varies. For example, implementations might automatically detect the format of compressed TAR archives for extraction so the user does not have to specify it, and let the user limit adding files to those modified after a specified date.[21][22]
Uses
[edit]Command syntax
[edit]tar [-options] <name of the tar archive> [files or directories which to add into archive]
Basic options:
-c, --create— create a new archive;-a, --auto-compress— additionally compress the archive with a compressor which will be automatically determined by the file name extension of the archive. If the archive's name ends with*.tar.gzthen use gzip, if*.tar.xzthen use xz,*.tar.zstfor Zstandard etc.;-r, --append— append files to the end of an archive;-x, --extract, --get— extract files from an archive;-f, --file— specify the archive's name;-t, --list— show a list of files and folders in the archive;-v, --verbose— show a list of processed files.
Basic usage
[edit]Create an archive file archive.tar from the file README.txt and directory src:
$ tar -cvf archive.tar README.txt src
Extract contents for the archive.tar into the current directory:
$ tar -xvf archive.tar
Create an archive file archive.tar.gz from the file README.txt and directory src and compress it with gzip :
$ tar -cavf archive.tar.gz README.txt src
Extract contents for the archive.tar.gz into the current directory:
$ tar -xvf archive.tar.gz
Tarpipe
[edit]A tarpipe is the method of writing an archive to standard output and piping it to another tar process on its standard input, working in another directory, where it is unpacked. This process copies an entire source directory tree including all special files, for example:
$ tar cf - srcdir | tar x -C destdir
Software distribution
[edit]The tar format continues to be used extensively for open-source software distribution. *NIX-distributions use it in various source- and binary-package distribution mechanisms, with most software source code made available in compressed tar archives.[citation needed]
Limitations
[edit]The original tar format was created in the early days of Unix, and despite current widespread use, many of its design features are considered dated.[23]
Other formats have been created to address the shortcomings of tar.
File names
[edit]Due to the field size, the original TAR format was unable to store file paths and names in excess of 100 characters.
To overcome this problem while maintaining readability by existing TAR utilities, GNU tar stores file paths and names in excess of the 100 characters are stored in @LongLink entries that would be seen as ordinary files by TAR utilities unaware of this feature.[24] Similarly, the PAX format uses PaxHeaders entries.[25]
Attributes
[edit]Many older tar implementations do not record nor restore extended attributes (xattrs) or access-control lists (ACLs). In 2001, Star introduced support for ACLs and extended attributes, through its own tags for POSIX.1-2001 pax. bsdtar uses the star extensions to support ACLs.[26] More recent versions of GNU tar support Linux extended attributes, reimplementing star extensions.[27] A number of extensions are reviewed in the filetype manual for BSD tar, tar(5).[26]
Tarbomb
[edit]A tarbomb, in hacker slang, is a tarball containing a large number of items whose contents are written to the current directory or some other existing directory when untarred instead of the directory created by the tarball specifically for the extracted outputs.[28] It is at best an inconvenience to the user, who is obliged to identify and delete a number of files interspersed with the directory's other contents. Such behavior is considered bad etiquette on the part of the archive's creator.
A related problem is the use of absolute paths or parent directory references when creating tar files. Files extracted from such archives will often be created in unusual locations outside the working directory and, like a tarbomb, have the potential to overwrite existing files. However, modern versions of FreeBSD and GNU tar do not create or extract absolute paths and parent-directory references by default, unless it is explicitly allowed with the flag -P or the option --absolute-names. The bsdtar program, which is also available on many operating systems and is the default tar implementation on Mac OS X v10.6, also does not follow parent-directory references or symbolic links.[29][failed verification]
If a user has only a very old tar available, which does not feature those security measures, these problems can be mitigated by first examining a tar file using the command tar tf archive.tar, which lists the contents and allows to exclude problematic files afterwards. These commands do not extract any files, but display the names of all files in the archive. If any are problematic, the user can create a new empty directory and extract the archive into it—or avoid the tar file entirely. Most graphical tools can display the contents of the archive before extracting them. Vim can open tar archives and display their contents. GNU Emacs is also able to open a tar archive and display its contents in a dired buffer.
Random access
[edit]The tar format was designed without a centralized index or table of content for files and their properties for streaming to tape backup devices. Instead, the metadata for each file (such as name, size, time stamps) for each file is stored in a header before each file. The archive must be read sequentially to list or extract files. For large tar archives, this causes a performance penalty, making tar archives unsuitable for situations that often require random access to individual files.
In turn, this design makes TAR archives resilient against damage from missing portions, in both the form of digital files and physical tape.[citation needed] A truncated TAR file with missing parts on either ends still allows recovering the parts that are not missing, including the file paths and file names and metadata, by starting from the first TAR header that is not missing.[30]
With a well-formed tar file stored on a seekable (i.e. allows efficient random reads) medium, the tar program can still relatively quickly (in linear time relative to file count) look for a file by skipping file reads according to the "size" field in the file headers. This is the basis for option -n in GNU tar. When a tar file is compressed whole, the compression format, being usually non-seekable, prevents this optimization from being done.[31] To maintain seekability, tar files must be also concatenated properly, by removing the trailing zero block at the end of each file.[32]
Duplicates
[edit]Another issue with tar format is that it allows several (possibly different) files in archive to have identical paths and filenames. When extracting such archive, usually the latter version of a file overwrites the former.
This can create a non-explicit (unobvious) tarbomb, which technically does not contain files with absolute paths or referring to parent directories, but still causes overwriting files outside current directory (for example, archive may contain two files with the same path and filename, first of which is a symlink to some location outside current directory, and second of which is a regular file; then extracting such archive on some tar implementations may cause writing to the location pointed to by the symlink).
Key implementations
[edit]Historically, many systems have implemented tar, and many general file archivers have at least partial support for tar (often using one of the implementations below). The history of tar is a story of incompatibilities, known as the "tar wars". Most tar implementations can also read and create cpio and pax (the latter actually is a tar-format with POSIX-2001-extensions).
Key implementations in order of origin:
- Solaris tar, based on the original Unix V7 tar and comes as the default on the Solaris operating system
- GNU tar is the default on most Linux distributions. It is based on the public domain implementation pdtar which started in 1987. Recent versions can use various formats, including ustar, pax, GNU and v7 formats.
- FreeBSD tar (also BSD tar) has become the default tar on most Berkeley Software Distribution-based operating systems including Mac OS X. The core functionality is available as libarchive for inclusion in other applications. This implementation automatically detects the format of the file and can extract from tar, pax, cpio, zip, rar, ar, xar, rpm and ISO 9660 cdrom images. It also comes with a functionally equivalent cpio command-line interface.
- Schily tar, better known as star (/ˈɛsˌtɑːr/, ESS-tar),[33] is historically significant as some of its extensions were quite popular. First published in April 1997,[34] its developer has stated that he began development in 1982.[35]
- Python tarfile module supports multiple tar formats, including ustar, pax and gnu; it can read but not create V7 format and the SunOS tar extended format; pax is the default format for creation of archives.[36] Available since 2003.[37]
Additionally, most pax and cpio implementations can read and create multiple types of tar files.
Suffixes for compressed files
[edit]tar archive files usually have the file suffix .tar (e.g. somefile.tar).
A tar archive file contains uncompressed byte streams of the files which it contains. To achieve archive compression, a variety of compression programs are available, such as gzip, bzip2, xz, lzip, lzma, zstd, or compress, which compress the entire tar archive. Typically, the compressed form of the archive receives a filename by appending the format-specific compressor suffix to the archive file name. For example, a tar archive archive.tar, is named archive.tar.gz, when it is compressed by gzip.
Popular tar programs like the BSD and GNU versions of tar support the command-line options Z (compress), z (gzip), and j (bzip2) to compress or decompress the archive file upon creation or unpacking. Relatively recent additions include --lzma (LZMA), --lzop (lzop), --xz or J (xz), --lzip (lzip), and --zstd.[38] The decompression of these formats is handled automatically if supported filename extensions are used, and compression is handled automatically using the same filename extensions if the option --auto-compress (short form -a) is passed to an applicable version of GNU tar.[16] BSD tar detects an even wider range of compressors (lrzip, lz4), using not the filename but the data within.[39] Unrecognized formats are to be manually compressed or decompressed by piping.
MS-DOS's 8.3 filename limitations resulted in additional conventions for naming compressed tar archives. However, this practice has declined with FAT now offering long filenames.

| Compressor | Long | Short |
|---|---|---|
| bzip2 | .tar.bz2 | .tb2, .tbz, .tbz2, .tz2 |
| gzip | .tar.gz | .taz, .tgz |
| lzip | .tar.lz | |
| lzma | .tar.lzma | .tlz |
| lzop | .tar.lzo | |
| xz | .tar.xz | .txz |
| compress | .tar.Z | .tZ, .taZ |
| zstd | .tar.zst | .tzst |
See also
[edit]References
[edit]- ^ "libarchive - C library and command-line tools for reading and writing tar, cpio, zip, ISO, and other archive formats @ GitHub". www.libarchive.org.
- ^ Sergey Poznyakoff (18 July 2023). "tar-1.35 released [stable]". Retrieved 26 July 2023.
- ^ John Gilmore (1986-12-10). "v07i088: Public-domain TAR program". Newsgroup: mod.sources. Archived from the original on 2022-02-07. Retrieved 2022-02-07.
- ^ "posixtar". Archived from the original on 2022-06-27. Retrieved 2022-02-07.
- ^ "star". Archived from the original on 2023-11-12. Retrieved 2023-11-12.
- ^ Gilmore, John; Fenlason, Jay (4 February 2019). "Basic Tar Format". gnu.org. and others. Free Software Foundation. Retrieved 17 April 2019.
- ^ a b "tar(5) manual page". FreeBSD.org. FreeBSD. 20 May 2004. Retrieved 2 May 2017.
- ^ IEEE Std 1003.1-1988, IEEE Standard for Information Technology - Portable Operating System Interface (POSIX)
- ^ IEEE Std 1003.1-2001, IEEE Standard for Information Technology - Portable Operating System Interface (POSIX)
- ^ IBM. "IBM System i Version 7.2 Programming Qshell" (PDF). IBM. Retrieved 2020-09-05.
- ^ "Announcing Windows 10 Insider Preview Build 17063 for PC". Windows Experience Blog. 2017-12-19. Retrieved 2 July 2018.
- ^ "Tar and Curl Come to Windows!". 2019-03-22.
- ^ "Blocking". ftp.gnu.org. Retrieved 2020-08-26.
- ^ Hoo, James. "Open/Extract TAR File with Freeware on Windows/Mac/Linux". e7z Org. Archived from the original on 6 February 2015. Retrieved 3 September 2019.
- ^ Kientzle, Tim (1995). Internet File Formats. Coriolis Groups Books. p. 196. ISBN 978-1-883577-56-8. Retrieved 2022-11-10.
- ^ a b c "GNU tar 1.32: 8.1 Using Less Space through Compression". GNU. 2019-02-23. Retrieved 2019-08-11.
- ^ NEWS, git.savannah.gnu.org - search for "Added support for POSIX.1-2001 and ustar archive formats."
- ^ "GNU tar 1.34: 8. Controlling the Archive Format". GNU. Retrieved 2022-07-11.
- ^ – Shell and Utilities Reference, The Single UNIX Specification, Version 5 from The Open Group
- ^ "#2116 Tars with pax headers not parsed". 7-Zip / Bugs | SourceForge.
- ^ GNU tar 1.35: 6.8 Operating Only on New Files
- ^ Differences Between BSD tar and GNU tar and star | Baeldung on Linux
- ^ "duplicity: New file format". duplicity.nongnu.org.
- ^ gnu_tar/src/create.c at master · gitGNU/gnu_tar · GitHub, line 546
- ^ src/bin/pax/tar.c at 8df76133309eacd4092b091ee0504adb842322a5 · openbsd/src · GitHub, line 1066
- ^ a b – FreeBSD File Formats Manual
- ^ "Extended attributes: the good, the not so good, the bad". Les bons comptes. 15 July 2014. Archived from the original on 14 December 2014. Retrieved 3 September 2019.
The extended attributes can be very valuable for storing file metadata (e.g. author="John Smith", subject="country landscape"), in the many cases where you do not want or can't store this data in the file internal properties.
- ^ "Tarbomb Definition". The Linux Info Project. Retrieved 2024-12-12.
- ^ "bsdtar(1)". man.freebsd.org.
- ^ Creating TAR with 100 KB missing at the beginning:
tail --bytes=+100000 "intact archive.tar" >>"missing beginning.tar". Next header can be found using a hex editor. Recover usingdd if="missing beginning.tar" of=recovered.tar ibs=[bytes until next header which starts with file path and name] skip=1. Quotation marks are not needed for file names without spaces. - ^ BillThor (July 28, 2017). "What makes a tar archive seekable?". Super User. Retrieved 15 December 2023.
- ^ "GNU tar 1.35: 4.2.4 Combining Archives with --concatenate". www.gnu.org.
- ^ Schilling, Jörg. "Star a very fast and Posix 1003.1 compliant tar archiver for UNIX". Archived from the original on 2023-07-09. Retrieved 2023-09-02.
- ^ Thomas E. Dickey (January 4, 2015). "TAR versus Portability: Schily tar". Retrieved October 23, 2021.
- ^ Jörg Schilling (September 4, 2021). "star - unique standard tape archiver". Archived from the original on October 23, 2021. Retrieved October 23, 2021.
- ^ tarfile module, python.org
- ^ tarfile.py, github.com
- ^ Poznyakoff, Sergey (2019-01-02). "tar-1.31 released [stable]". GNU mailing lists. Retrieved 2019-08-06.
- ^ – FreeBSD General Commands Manual
External links
[edit]- X/Open CAE Specification Commands and Utilities Issue 4, Version 2 (pdf), 1994, opengroup.org – indicates tar as to be withdrawn
- tar in The Single UNIX Specification, Version 2, 1997, opengroup.org – indicates applications should migrate to pax
- C.4 Utilities in The Open Group Base Specifications Issue 6, 2004 Edition, opengroup.org – indicates tar as removed
- – Shell and Utilities Reference, The Single UNIX Specification, Version 5 from The Open Group – specifies the ustar and pax file formats
- – Version 7 Unix Programmer's Manual
- – manual from GNU
- – Plan 9 Programmer's Manual, Volume 1
- – Solaris 11.4 User Commands Reference Manual
- – FreeBSD General Commands Manual
- – OpenBSD General Commands Manual
- – Linux User Manual – User Commands from Manned.org
- – FreeBSD File Formats Manual
- TAR - Windows CMD - SS64.com
Tar (computing)
View on Grokipediac option), extracting files from archives (x option), listing archive contents (t option), appending files (r option), and updating archives (u option), all controlled via command-line options and supporting blocking factors for tape devices.[4] The basic POSIX tar format limits individual file sizes to 8 gigabytes and pathnames to 256 characters, though modern implementations like GNU tar extend this with formats such as GNU-specific headers or the POSIX.1-2001 pax interchange format for larger files and extended attributes.[4][7]
Despite its origins in tape archiving, tar remains a foundational tool in Unix-like operating systems for software distribution, system backups, and data packaging, often combined with compression utilities like gzip (resulting in .tar.gz files) or bzip2 (.tar.bz2) to reduce archive size.[1] The POSIX standard recommends migrating to the pax utility for enhanced portability, but tar continues to be widely used due to its simplicity and backward compatibility.[4]
Background
History
The tar utility originated in the Unix Seventh Edition (V7), released by AT&T Bell Laboratories in January 1979, where it was introduced as a simple tool for creating tape archives as part of the system's backup capabilities, replacing the earlier tp program.[8] It was designed to bundle multiple files into a single archive stream suitable for magnetic tape storage, complementing earlier backup tools like dump, which focused on incremental filesystem dumps.[8] This initial implementation emphasized portability and ease of use for archiving directories and files onto tape devices, marking a shift toward standardized archiving in early Unix environments.[9] In the early 1980s, as Unix variants proliferated beyond AT&T's proprietary systems, the need for freely redistributable software grew. John Gilmore developed a public-domain implementation of tar in late 1987, initially as pdtar, which was posted to Usenet and became highly influential for its clean code and compatibility with emerging standards drafts.[10] This version addressed limitations in proprietary implementations and facilitated adoption in academic and open-source communities, laying the groundwork for further enhancements. By 1987, Gilmore's pdtar had evolved into the basis for GNU tar, first released in 1988 as part of the GNU Project, introducing features like multi-volume support to handle archives spanning multiple tapes or disks during the era of limited storage capacities.[11] BSD variants, such as those in 4.3BSD (1986), also incorporated and extended tar with improvements for network file systems and longer pathnames, contributing to its divergence across Unix lineages.[10] Standardization efforts began in the late 1980s to unify tar's behavior across Unix systems. The utility was included in POSIX.1-1988 (IEEE Std 1003.1-1988), which defined the basic tar format and command interface, including the USTAR (Unix Standard Tape ARchive) format for better support of long filenames, permissions, symbolic links, and device files, ensuring interoperability for core operations like archiving and extraction.[8] Subsequent POSIX revisions expanded on this foundation: POSIX.1-2001 introduced pax extensions for enhanced portability, including global extended headers for attributes beyond the original limits.[8] These standards, maintained through The Open Group, continued evolving; POSIX.1-2024, published in 2024, reaffirms tar's role while incorporating modern filesystem considerations, ensuring its relevance in contemporary Unix-like systems.[12]Rationale
The tar command emerged in early Unix systems to address the need for a straightforward, portable mechanism to bundle multiple files and directories into a single archive, facilitating backups and transfers across limited hardware environments like those at Bell Labs in the late 1970s. This design responded to the practical demands of Unix developers who required a tool capable of handling file collections without relying on complex proprietary formats, ensuring compatibility across different Unix implementations and even non-Unix systems through simple binary streams. By focusing on core archiving functionality, tar enabled efficient storage on sequential media and easy distribution, aligning with the era's resource constraints and emphasis on interoperability. A key motivation in tar's design was the preservation of essential file metadata, such as permissions, timestamps, ownership details, and directory structures, to allow faithful restoration of files in their original configuration. This capability was vital for system administration tasks, where altering metadata could compromise security or functionality, and it distinguished tar from simpler concatenation tools by providing a reliable way to capture the full context of Unix file system objects. Without such preservation, backups would lose critical attributes, rendering restores incomplete or insecure. Originally tailored for tape archiving—hence the name "tape archiver"—tar's format was optimized for sequential, appendable operations on magnetic tapes, the predominant backup medium in early computing labs. This choice prioritized streamability over random access, allowing archives to be written and read in a continuous flow suitable for tape drives, while later adaptations extended its use to disk files and network pipes without fundamental changes. The design reflected a deliberate trade-off: simplicity and modularity over integrated features like compression or indexing, encouraging composition with separate tools (e.g., compress or gzip) to maintain a lean core while supporting extensible workflows. Tar's development was influenced by the shortcomings of predecessor tools, aiming to create more robust, streamable archives that could handle growing file system complexities in evolving Unix versions. By emphasizing portability and appendability, it overcame earlier limitations in backup utilities, establishing a format that remains foundational for Unix-like systems despite shifts in storage technology.File Format
Basic Header Structure
The basic header structure in a tar archive uses a fixed 512-byte block for each file or directory entry, providing essential metadata in a contiguous, ASCII-encoded format to ensure portability across systems. This block precedes the file's data blocks (if any) and is designed for sequential reading from tape archives, with all fields occupying exact byte positions without variable-length encoding. The structure supports core attributes like permissions, ownership, size, and timestamps, while the remaining bytes after the defined fields are filled with null bytes (0x00) for padding to reach precisely 512 bytes.[6] Key fields in the header include the filename (bytes 0-99, up to 100 characters, null-terminated if shorter), file mode (bytes 100-107, 8 bytes representing octal permissions), user ID (bytes 108-115, 8 bytes in octal), group ID (bytes 116-123, 8 bytes in octal), file size (bytes 124-135, 12 bytes in octal for the byte length of the file data), modification time (bytes 136-147, 12 bytes in octal as seconds since the Unix epoch), and link name (bytes 157-256, 100 bytes for the target path in case of links). The link indicator (byte 156, 1 byte) is NUL (ASCII 0) or space for regular files and directories, and '1' for hard links. Numeric fields like size, mode, UID, GID, and mtime are encoded as right-justified octal strings in printable ASCII digits, padded with leading spaces (0x20) to their full width, and typically terminated by a space or null byte. This legacy 100-byte limit on filenames and link names restricts paths to relatively short lengths, often requiring workarounds for longer names in modern use.[6] The checksum field (bytes 148-155, 8 bytes in octal) ensures data integrity by verifying the header itself; it is computed as the sum of all 512 bytes treated as unsigned characters, but with the checksum field temporarily filled with eight space characters (0x20) during calculation, excluding the actual checksum bytes. The resulting sum is then converted to an 8-byte octal string (right-justified, leading spaces, terminated by space or null) and inserted into the field. Upon reading, the process is reversed to validate the header against corruption.[6] To denote the end of the archive, tar formats require two consecutive 512-byte blocks filled entirely with binary zeros (0x00), serving as an explicit terminator regardless of the number of entries or padding. This marker allows readers to detect the archive's conclusion even if the underlying storage (like tape) ends abruptly.[8]| Field | Bytes | Length (bytes) | Format | Description |
|---|---|---|---|---|
| name | 0-99 | 100 | ASCII string, null-padded | Filename or directory path |
| mode | 100-107 | 8 | Octal ASCII, space-padded | File permissions (e.g., 0644) |
| uid | 108-115 | 8 | Octal ASCII, space-padded | User ID (owner) |
| gid | 116-123 | 8 | Octal ASCII, space-padded | Group ID |
| size | 124-135 | 12 | Octal ASCII, space-padded | File size in bytes (0 for directories) |
| mtime | 136-147 | 12 | Octal ASCII, space-padded | Modification time (Unix timestamp) |
| chksum | 148-155 | 8 | Octal ASCII, space-padded | Header checksum |
| typeflag | 156 | 1 | ASCII character | Link indicator (NUL or space for files/dirs, '1' for hard links) |
| linkname | 157-256 | 100 | ASCII string, null-padded | Target path for links (unused for regular files) |
| padding | 257-511 | 255 | Null bytes (0x00) | Unused space |
UStar Format
The UStar format, short for Unix Standard Tape ARchive, was introduced in the POSIX.1-1988 standard to enhance portability of tar archives across Unix systems, addressing limitations in the original format such as short filename lengths and lack of support for user and group names.[8][13] This extension builds on the basic 512-byte header block structure while adding fields to support longer paths and additional metadata, enabling filenames up to 256 characters through a combination of a 100-byte name field and a new 155-byte prefix field that precedes the filename with a slash separator.[14][13] Key additions in the UStar header include a 6-byte magic field set to "ustar" followed by a null byte, and an 2-byte version field set to "00", which identify the format and ensure recognition by compliant tools.[14][13] It also introduces 32-byte fields for uname (user name) and gname (group name), allowing archival of ownership information beyond numeric IDs, as well as 8-byte octal fields for devmajor and devminor to represent major and minor device numbers for special files like character and block devices.[14][13] The format supports POSIX device types through type flags in a single-byte field, including '3' for character special files, '4' for block special files, and '7' for contiguous files that can be treated as regular files for improved performance on certain media.[14] Extended attributes, such as access control lists (ACLs), can be handled via the format's provisions for future extensions, though full ACL support was refined in later standards.[14] The checksum field in UStar extends the original method by calculating an 8-byte octal sum over all 512 bytes of the header, treating the checksum field itself as filled with spaces during computation to avoid circular dependency, which improves integrity verification and includes previously ignored fields like the prefix.[14][13] For backward compatibility with pre-UStar tar readers, the format maintains the core structure and positions the new fields in unused space of the original header, allowing older tools to ignore unknown bytes while still extracting basic file data.[14] This design ensures UStar archives remain readable on legacy systems without requiring format conversion.[8]| Field Name | Offset (bytes) | Length (bytes) | Description |
|---|---|---|---|
| prefix | 345 | 155 | Path prefix for long filenames (octal-padded, null-terminated) |
| magic | 257 | 6 | "ustar" followed by null |
| version | 263 | 2 | "00" |
| uname | 265 | 32 | User name (null-terminated) |
| gname | 297 | 32 | Group name (null-terminated) |
| devmajor | 329 | 8 | Device major number (octal) |
| devminor | 337 | 8 | Device minor number (octal) |
POSIX.1-2001 and pax Extensions
The POSIX.1-2001 standard introduced significant extensions to the tar archive format through the pax interchange format, enabling support for modern file systems and attributes beyond the limitations of prior formats.[15] This format maintains backward compatibility with ustar archives while adding flexibility via extended header records, which precede the regular file data and allow for the storage of additional metadata.[8] Extended headers in the pax format utilize specific type flags to encode information: type flag 'x' denotes a per-file extended header, containing metadata applicable only to the immediately following file, while type flag 'g' indicates a global extended header that applies to all subsequent files in the archive until overridden.[15] These headers consist of ASCII key-value pairs, where keys are standardized keywords (such as path for filenames, size for file sizes, mtime for modification times, uid and gid for ownership, and linkpath for links) or vendor-specific extensions prefixed with vendor identifiers, separated by an equals sign from their decimal or UTF-8 encoded values.[15] The key-value structure permits arbitrary attributes, overcoming ustar field length restrictions—for instance, the path keyword supports filenames and paths exceeding 256 characters, and the size keyword enables representation of files larger than 8 GB using arbitrary-length decimal strings rather than fixed octal fields.[15] The pax format supports sparse files through implementation-defined keywords in extended headers, such as GNU.sparse.map in GNU tar implementations, to describe maps of allocated blocks and holes (unallocated regions) and optimize storage by omitting zero-filled holes.[15] Global extended headers, via type flag 'g', facilitate archive-wide metadata, such as user ID mappings (uname and gname keywords linking numeric IDs to symbolic names) or default attributes applied across multiple files, enhancing portability in heterogeneous environments.[15] For incremental archiving, implementations may leverage directory modification times (mtime) to identify changed files since the last backup, though specific mechanisms vary by tool.[15] Subsequent revisions refined these extensions for better internationalization and robustness. POSIX.1-2008 mandated UTF-8 encoding for all textual fields in extended headers, including paths and names, to ensure consistent handling of international characters across locales. POSIX.1-2017 further emphasized security enhancements, such as recommendations for implementations to validate paths against traversal attempts (e.g., rejecting entries with leading slashes or excessive parent directory references) to mitigate risks like tarbomb extractions.Core Functionality
Key Features
The tar utility is designed to preserve the hierarchical directory structure of filesystems, maintaining the full tree organization including subdirectories and their relative paths during archiving and extraction. This capability ensures that the archived files can be restored to their original layout without loss of organizational integrity, distinguishing tar from simpler concatenation tools.[16] A core strength of tar lies in its retention of essential file metadata, including permissions (stored as the mode field), ownership information (user ID and group ID via uid and gid fields), modification timestamps (mtime), and support for both hard and symbolic links (indicated by type flags such as '1' for hard links and '2' for symlinks). These elements are encoded in the archive header for each member, allowing faithful reproduction of file attributes upon extraction, which is critical for system backups and software distribution.[7] Tar supports multi-volume archives, enabling the creation of large archives split across multiple media or files, such as tapes, by automatically prompting for volume changes and continuing the operation seamlessly. This feature, facilitated through options like --multi-volume, accommodates storage limitations on older or constrained devices while maintaining archive integrity across volumes.[17] The append mode allows users to add new files to an existing tar archive without needing to extract and recreate it, using mechanisms that update the archive incrementally while preserving the original contents. This efficiency is particularly useful for ongoing backup scenarios where only changes need to be incorporated.[18] For incremental backups, tar provides support through the --listed-incremental option, which uses snapshot files to compare and archive only modified or new files since the last backup, based on timestamp and inode comparisons. This method optimizes storage and time by avoiding full re-archiving of unchanged data.[19] Due to its adherence to standardized formats like POSIX.1-1988 and subsequent extensions, tar exhibits high portability across Unix-like operating systems, ensuring archives created on one platform can be reliably read and extracted on another without format incompatibilities.[16]Command Syntax
The tar command employs the general syntaxtar [options...] [archive-file] [files...], where options define the operation and modifiers, the optional archive-file specifies the target archive (defaulting to standard output or an environment-defined device), and files... lists the paths to process or patterns for selection.[17]
Options fall into key categories, including action modes that determine the primary operation: -c or --create to form a new archive from specified files; -x or --extract (or --get) to unpack files from an existing archive; -t or --list to display archive contents without extraction; -r or --append to add files to the end of the archive; and -u or --update to append only files newer than their counterparts in the archive. File selection options refine which paths are included or excluded, such as --exclude=[PATTERN](/page/Pattern) to skip files matching a given pattern and --include=[PATTERN](/page/Pattern) to limit processing to matching files only.[17] Output control options manage archive handling and working directories, notably -f, --file=NAME to designate the archive file or device and -C, --directory=DIR to switch to directory DIR prior to each file operation.[17]
GNU tar supports both short and long option forms for flexibility, with short options prefixed by a single hyphen (e.g., -f) and long options by two (e.g., --file=NAME); short options lacking arguments can be bundled consecutively after a single hyphen (e.g., -cf combining --create and --file), while those requiring arguments must follow immediately (e.g., -farchive.tar).[20] The order of options matters minimally except for the primary action mode, which must appear before operands, and tar processes non-option arguments as file names after all options.[17]
Environment variables influence default behaviors, such as TAPE, which sets the archive name or device when -f is omitted, allowing invocation without explicit file specification.[17] For error handling, flags like --warning=KEYWORD (or -w) enable or suppress warnings for non-fatal conditions, with keywords such as no-file-changed to alert on unsuccessful file updates without halting execution.[17] Options like --multi-volume further support features such as spanning archives across multiple media.[20]
Basic Operations
The basic operation for creating a tar archive involves using the--create (or -c) option combined with --file (or -f) to specify the archive name, followed by the paths of the files or directories to include.[21] For example, the command tar -cf archive.tar file1.txt file2.txt bundles the specified files into a single archive file named archive.tar.[21] When including directories, such as tar -cf archive.tar directory/, the command recursively adds all contents within that directory while preserving the internal structure.[21]
Wildcards can be used in the file list to select multiple items efficiently, as the shell expands them before tar processes the arguments; for instance, tar -cf archive.tar *.txt archives all files ending in .txt in the current directory.[22] Regarding paths, GNU tar by default stores relative paths in the archive and strips any leading slash from absolute paths to avoid embedding the full filesystem hierarchy, ensuring portability; however, the -P or --absolute-names option can be used if absolute paths must be preserved. This behavior helps prevent issues when extracting archives on different systems.
To extract files from a tar archive, the --extract (or -x) option is employed alongside -f to specify the archive, as in tar -xf archive.tar, which restores the contents to the current directory while recreating the original directory structure. For path adjustment during extraction, the --strip-components=N option removes the first N leading components from member names; for example, tar -xf archive.tar --strip-components=1 discards the top-level directory, placing files directly in the current directory instead.
Listing the contents of an archive without extracting uses the --list (or -t) option with -f, such as tar -tf archive.tar, which outputs the names of all members in the archive. Adding the -v or --verbose flag provides detailed metadata, including permissions, ownership, sizes, and modification times, as in tar -tvf archive.tar, allowing inspection of archive details without modifying the filesystem.
A common pitfall during extraction is the potential overwriting of existing files with the same names, which tar performs by default without prompting.[23] This can be mitigated using the --keep-old-files (or -k) option, which treats such conflicts as errors and skips replacement, preserving the original files; for instance, tar -xkf archive.tar will halt or skip on duplicates rather than overwriting.[23] Alternatively, --skip-old-files silently ignores existing files without erroring, suitable for non-interactive scripts.[23]
Practical Applications
Piping and Compression Integration
One of the key strengths of tar lies in its ability to integrate seamlessly with compression utilities through Unix pipes, enabling the creation of compressed archives without generating intermediate uncompressed files. This process, often referred to as tar piping, involves directing tar's output stream to a compressor in real-time. For instance, the commandtar cf - directory/ | gzip > archive.tar.gz` creates an uncompressed tar stream from the specified directory and pipes it directly to gzip for compression, producing a .tar.gz file efficiently. This streaming approach leverages the Unix pipe mechanism to process data on-the-fly, minimizing temporary storage needs and reducing overall disk I/O, which is particularly beneficial for handling large datasets or when working in resource-constrained environments.[24]
GNU tar enhances this integration by providing built-in command-line flags that automate the piping to common compression tools, eliminating the need for explicit pipe syntax in many cases. The -z flag invokes gzip for both creation (tar czf archive.tar.gz directory/) and extraction (tar xzf archive.tar.gz), while -j pairs with bzip2 (tar cjf archive.tar.bz2 directory/) and -J with xz (tar cJf archive.tar.xz directory/) for higher compression ratios at the cost of increased CPU usage. These options support auto-detection of the compression format based on file extensions during extraction, allowing tar to transparently invoke the appropriate decompressor.[24] Bidirectional piping extends this flexibility to extraction workflows, such as gunzip -c archive.tar.gz | tar xf -, which decompresses the input stream and feeds it to tar for unpacking without writing the decompressed tar to disk.
Historically, early tar implementations required manual piping to external compressors like compress or gzip as separate steps, but GNU tar introduced integrated flags starting with the -z option for gzip in versions around 1992, marking a shift toward streamlined, user-friendly compressed archiving.[25] This evolution improved workflow efficiency and popularized compressed tar formats in Unix-like systems, as the combined operations reduce processing time and storage overhead for backups and distributions.[24]
Software Distribution and Packaging
Tar archives, commonly known as tarballs, play a central role in source code distribution for open-source software projects, bundling source files, build scripts, documentation, and configuration files into a single, portable file while preserving file permissions, ownership, and directory structures essential for automated builds. In systems using GNU Autotools, such as autoconf and automake, themake dist target generates a compressed tar archive (typically .tar.gz) that includes all necessary components for compilation and installation on various Unix-like systems, ensuring reproducibility without requiring version control metadata. This format has been a standard for distributing GNU software since the late 1980s, when the GNU Project began releasing tools and utilities in tarball form to facilitate free software sharing and modification.
Historically, tarballs have been integral to major software releases; for instance, GNU programs like the original tar utility itself were distributed via tar archives starting from its early versions in the 1980s, aligning with the project's goal of creating a free Unix-like operating system. Similarly, the Linux kernel sources have been provided as tarballs on kernel.org since the kernel's inception in 1991, allowing developers worldwide to download, compile, and contribute to the codebase with consistent file integrity and structure.
Tar serves as a foundational component in several binary package formats used for software distribution. In Debian-based systems, .deb packages encapsulate their payload in a data.tar archive, typically compressed with gzip, xz, or zstd (e.g., data.tar.gz, data.tar.xz, data.tar.zst), which contains the installed files and is extracted during package installation to place binaries, libraries, and resources in the appropriate system directories.[26] For RPM-based distributions, source RPMs (.src.rpm) incorporate upstream tarballs as the primary source archive, which rpmbuild unpacks during the preparation phase to apply patches and build binaries. AppImages, a portable application format, are often constructed from extracted tar.gz bundles using tools like pkg2appimage, enabling self-contained executables that run without system-wide installation.
Best practices for creating distribution tarballs emphasize cleanliness and security to avoid including unnecessary or sensitive data. Developers routinely exclude version control directories like .git using the --exclude-vcs option in GNU tar, preventing the inclusion of repository history that could bloat the archive or expose private information. Additionally, signing tarballs with tools like GPG or minisign is recommended to verify integrity and authenticity, as outlined in GNU guidelines for source packages, where detached signatures accompany the archive for user validation.
In modern contexts, tar extends to containerization and cloud environments; Docker container images are layered using tar archives for efficient storage and transfer, with each layer representing an immutable filesystem snapshot that can be imported or exported via docker save and docker load. Cloud platforms like OpenShift leverage tar for packaging application artifacts during builds and deployments, streaming archives to build nodes for rapid assembly into container images.[27] Tarballs are often compressed with gzip or xz to reduce download sizes in these workflows.
Limitations
Path and Filename Handling
The original tar format, derived from Version 7 Unix, limits filenames to 100 bytes, including the null terminator, which restricts paths to relatively short names without support for longer hierarchies or prefixes. This constraint often leads to truncation or errors when archiving files with extended paths, as the header block allocates exactly 100 bytes for the name field. The UStar format, standardized in POSIX.1-1988, extends this capability by introducing a 155-byte prefix field for directory paths, allowing a total pathname length of up to 256 bytes when combined with the 100-byte name field and a separating slash. In this structure, the prefix holds the leading directory components, while the name field stores the basename, enabling better support for deeper directory trees without altering the core header size. The POSIX.1-2001 pax format further removes these limits by using extended header records to store arbitrary-length pathnames and filenames as key-value pairs before the file data, supporting paths of effectively unlimited size in compliant implementations.[15] Tar archives can pose path traversal risks during extraction if they contain absolute paths (starting with '/') or sequences like '../' that navigate outside the intended directory. By default, GNU tar strips the leading '/' from absolute paths to prevent writing to the filesystem root, but enabling the --absolute-names option restores them, potentially allowing overwrites in sensitive locations. Similarly, '../' sequences enable upward traversal, which may overwrite files in parent directories if extraction occurs without isolation, such as in a non-empty working directory.[28] A tarbomb refers to a maliciously crafted tar archive designed to scatter files across the filesystem upon extraction, often using relative paths, multiple directory levels, or symlink tricks to clutter or overwrite unintended areas.[29] These can overwhelm storage or compromise system integrity, particularly if extracted by privileged users. Mitigations include GNU tar's --no-overwrite-dir option, which preserves metadata of existing nonempty directories without overwriting their contents, and --keep-old-files, which refuses extraction of conflicting files entirely.[23] Additional safeguards involve extracting to an empty temporary directory or using tools like bsdtar with strict path normalization to block traversal attempts.[23] Legacy tar implementations assume ASCII encoding for filenames, limiting support to 7-bit characters and causing issues with international or extended sets on modern systems. Contemporary tools, such as GNU tar in POSIX.1-2001 mode, accommodate UTF-8 by storing filename bytes directly from the filesystem and using extended pax headers to declare encoding metadata, ensuring compatibility with Unicode paths.[15] Handling special characters in tar filenames involves no inherent escaping within the archive itself, as tar preserves the raw byte sequence from the source filesystem, but command-line invocation requires shell escaping (e.g., backslashes) to pass names containing spaces, quotes, or glob characters correctly. Portability challenges arise across systems with varying character restrictions; for instance, Windows-derived tools may reject certain Unicode or control characters that Unix tolerates, while older Unix variants limit to portable sets like alphanumeric, underscore, and period to avoid decoding errors. To enhance cross-platform reliability, filenames should avoid non-ASCII or control bytes, aligning with POSIX recommendations for the portable filename character set.[30]Attribute and Permission Preservation
The tar archive format stores file attributes in the header block preceding each file's data. The mode field, occupying bytes 100 through 107, is an 8-byte octal string representing the file permissions (nine bits for read, write, and execute access for owner, group, and others) along with three special bits for setuid, setgid, and sticky modes.[7] The user ID (uid) and group ID (gid) fields, at bytes 108-115 and 116-123 respectively, are also 8-byte octal strings storing numeric identifiers.[7] The modification time (mtime) is recorded in bytes 136-147 as a 12-byte octal string denoting seconds since the Unix epoch (January 1, 1970).[7] These fields in the POSIX ustar format support values up to 0777777 (octal) or 2097151 (decimal) for mode, uid, and gid, and 077777777777 (octal) or 8589934591 (decimal) for mtime, limiting timestamps to second-level precision and potentially causing overflow for very large IDs or future dates.[6] During extraction, preserving these attributes presents challenges, particularly for ownership. Setting the original uid and gid requires root privileges on Unix-like systems, as only the superuser can assign arbitrary user and group IDs; without them, tar implementations like GNU tar default to the extracting user's uid and gid. For permissions, the mode is applied where possible, but special bits (setuid, setgid, sticky) are typically ignored or cleared unless extracted by root. Fallback mechanisms include mapping usernames (uname) and group names (gname) from the header to local equivalents if they exist, prioritizing names over numeric IDs for compatibility across systems.[7] Options like --same-owner in GNU tar attempt to restore ownership numerically even for non-root users, but success depends on system capabilities and may result in the extracting user's ownership if mappings fail. Modern tar implementations extend attribute preservation through POSIX.1-2001 pax format and vendor-specific features. GNU tar and pax-compatible tools support extended attributes (xattrs), which store additional metadata such as access control lists (ACLs) and SELinux security labels, using dedicated options like --xattrs, --acls, and --selinux during both creation and extraction.[31] These are archived as supplementary headers in pax format, allowing preservation of filesystem-specific attributes beyond basic POSIX modes.[31] For timestamps, the original ustar format's 11-decimal-digit mtime limit provides only second precision, but extensions in GNU tar and pax formats append nanosecond fields (up to 9 digits) in global or per-file extended headers, enabling sub-second accuracy on supporting filesystems like ext4.[6] Cross-platform extraction introduces further complications, especially between Unix-like systems and Windows. Unix permissions do not directly map to Windows NTFS ACLs, leading to mismatches where executable bits may be lost or directories become read-only; GNU tar on Windows (via Cygwin or MSYS2) approximates Unix modes but cannot fully replicate them without additional tools. To mitigate this, the --mode option in GNU tar allows overriding extraction modes with a specific octal value (e.g., --mode=0755), ensuring consistent permissions regardless of the host OS, though ownership and extended attributes remain Unix-centric and often unsupported on Windows.Security Risks
Tar archives pose several security risks, particularly when extracting untrusted files, as the format lacks inherent mechanisms to prevent malicious content from causing harm. One significant concern involves the potential for automatic execution of scripts contained within the archive. In certain tools and environments, such as specific package management systems or automated installers that process tar files, embedded scripts may be triggered during or immediately after extraction, enabling command injection attacks if the archive is sourced from unverified providers.[32] Additionally, tar does not include built-in support for digital signatures or integrity verification, making it susceptible to tampering or corruption during transmission. Users must therefore depend on external utilities, such as GnuPG (gpg), to validate the archive's authenticity and wholeness before extraction; for instance, signatures are typically provided separately and verified by piping the archive through gpg.[33][34] Extracting tar archives with elevated privileges exacerbates these vulnerabilities, as malicious files within the archive—such as setuid binaries or configuration-altering scripts—can gain system-level access, leading to privilege escalation. Performing extractions as the root user, a common practice in system administration, can thus transform a seemingly benign archive into a vector for widespread compromise, including unauthorized modifications to critical system components.[34][35] Historical exploits highlight the long-standing nature of these issues; for example, GNU tar versions prior to 1.13.25 were vulnerable to symlink attacks that enabled arbitrary file overwrites (CVE-2002-1216), allowing attackers to replace sensitive files through crafted symbolic links during extraction.[36] To mitigate these risks, administrators should extract untrusted tar files in isolated environments, such as using sandboxing tools like chroot or unshare namespaces, to contain potential damage. Implementations like bsdtar (from libarchive) offer secure flags, including --no-same-owner to disregard ownership metadata and --no-same-permissions to ignore file permissions from the archive, thereby preventing the inheritance of potentially malicious attributes. Furthermore, scanning archives with antivirus software prior to extraction and avoiding root privileges during the process are essential best practices; the GNU tar documentation explicitly advises against allowing untrusted users to access extracted files without prior inspection for issues like setuid programs.[37][34]Access and Duplication Issues
Tar archives are inherently sequential in structure, consisting of a continuous stream of file headers and data without an index or central directory for quick navigation. This design necessitates scanning the entire archive from the beginning to list contents, extract specific files, or determine file positions, which can be inefficient for large archives or frequent random access operations. Unlike formats such as ZIP, which include a central directory enabling direct seeking to individual files, tar's stream-oriented approach prioritizes simplicity and compatibility with tape drives but limits performance in scenarios requiring non-linear access.[38][39][40] When extracting files, tar handles duplicates—files with names already present in the destination—by default overwriting them without warning, potentially leading to unintended data loss. To mitigate this, GNU tar provides options like--skip-old-files, which silently skips extraction of existing files, and --keep-old-files, which treats existing files as errors and halts the process unless overridden. Additionally, the --warning=existing-files option issues verbose warnings about skipped files, aiding in monitoring potential overwrites during extraction. These behaviors ensure controlled duplication management but require user configuration to avoid silent alterations.[23][41]
Multi-volume tar archives, used for spanning large datasets across multiple media like tapes or disks, impose limitations due to their sequential nature and lack of built-in compression support. Creation or extraction often requires manual intervention, such as prompting the user to insert the next volume when the current one fills, which can disrupt automated workflows. Some implementations, including GNU tar, offer automation via the --new-volume-script option, allowing scripted handling of volume changes, though this still demands careful setup to manage spans effectively. These constraints make multi-volume tar suitable for backup scenarios but less ideal for seamless large-scale operations.[42]
Scalability challenges arise when archiving large directories, particularly those with millions of small files, as tar may consume significant memory to build internal file lists or buffers during creation or extraction. For instance, processing directories with over a million files can lead to system crashes due to excessive RAM usage in default configurations. The --one-file-system option addresses this by restricting archiving to the source filesystem, preventing recursive traversal across mount points that could exponentially increase the workload and memory demands. This mitigation enhances performance in multi-filesystem environments but underscores tar's limitations in handling vast, nested structures without additional tuning.[43][44]
Implementations and Conventions
Major Implementations
GNU tar is the most widely used implementation on Linux systems, providing full support for the POSIX.1-2001 archive format through options like--format=posix and extensive extensions for modern features.[45] It integrates compression directly via command-line flags such as -z for gzip and -j for bzip2, allowing seamless creation of compressed archives without external tools. Additionally, GNU tar supports incremental backups using snapshot files with --listed-incremental, enabling efficient updates to archives by tracking changes since the last backup.
BSD tar, often implemented via the libarchive library with bsdtar as its command-line frontend, emphasizes strict adherence to POSIX standards, including IEEE Std 1003.1-2001 for ustar and pax interchange formats.[46] This implementation excels in cross-platform portability, supporting extraction and creation across Unix-like systems, Windows, and macOS through libarchive's broad format compatibility, which includes tar, cpio, zip, and more.[47] Unlike GNU tar's more permissive extensions, BSD tar prioritizes standards compliance to ensure reliable interoperability, though it may reject non-standard GNU-specific headers.[47]
Star, part of the Schily tools suite developed by Joerg Schilling, extends the UStar format with enhanced support for access control lists (ACLs) via the exustar format and Rock Ridge extensions for ISO 9660 CD-ROM archives, improving data integrity and filesystem attribute preservation.[48] It focuses on high performance and robustness, particularly for media archiving, by handling extended attributes like SELinux labels and ensuring backward compatibility with POSIX while adding proprietary keys like SCHILY.acl for ACL storage.[49]
Platform-specific variants include BusyBox tar, a lightweight implementation designed for embedded systems, which provides basic tar functionality in a minimal footprint under 1 MB to support resource-constrained environments like IoT devices. Python's tarfile module offers a programmatic interface for scripting tar operations, supporting reading and writing of POSIX.1-2001 compliant archives with built-in handling for gzip, bzip2, and lzma compression, making it ideal for automated tasks in cross-platform applications.[13]
Key differences among these implementations lie in their approach to standards and extensions: GNU tar favors liberal enhancements for usability on Linux, potentially reducing portability, while BSD tar maintains conservatism for broad compatibility, and star prioritizes integrity with specialized media features. The following table summarizes format compatibility based on portability tests:
| Feature/Format | GNU tar | BSD tar (libarchive) | Star (Schily) |
|---|---|---|---|
| POSIX UStar | Full | Full | Full |
| POSIX Pax | Full | Full | Full |
| GNU Extensions | Native | Partial (reads, rejects some) | Partial |
| Star/SCHILY Keys | Partial | Reads most | Native |
| ACL Support | Via xattr | Via pax extensions | Native (exustar) |
| Rock Ridge | No | Partial (ISO read) | Full |
Compressed Archive Suffixes
Compressed tar archives employ standardized filename extensions to denote the compression algorithm applied to the underlying .tar file, enabling easy identification and automated processing in tools like GNU tar. These conventions arose from common practices in Unix-like systems, where the base .tar extension signifies an uncompressed archive, while appended suffixes indicate compression for reduced storage and transmission efficiency.[45][8] The following table summarizes the primary standard extensions:| Extension | Compression Method | Notes |
|---|---|---|
| .tar | None (uncompressed) | Basic tarball format.[45] |
| .tar.gz, .tgz | gzip | Widely used for its balance of speed and compression ratio; .tgz is a common shorthand.[13] |
| .tar.bz2, .tbz | bzip2 | Offers better compression than gzip at the cost of slower processing.[13] |
| .tar.xz | xz (LZMA) | Provides high compression ratios, suitable for large archives.[13] |
| .tar.zst | zstd (Zstandard) | High compression ratios with good speed; supported natively in GNU tar and other modern tools.[52] |
| .tar.Z | compress (legacy) | Older Unix compression method, less efficient and rarely used today.[8] |
| .tar.lz | lzip | Employs LZMA-like compression with emphasis on data integrity and error recovery. |
--auto-compress option, which examines the archive's filename suffix to determine the format during creation or extraction. This feature streamlines workflows by eliminating the need to specify compression flags explicitly, provided the suffix matches one of the recognized patterns such as those listed above.
Variations exist beyond these standards, including .tar.lzma for direct LZMA compression, which is supported in some tools but not as universally as .tar.xz.[13] Non-standard combinations like .tar.7z, which apply 7-Zip compression to a tar archive, are occasionally used but lack broad tool support and are not recommended for interoperability.[53] Platform-specific extensions, such as .taz on Amiga systems for archives compressed with the legacy compress utility, further illustrate historical adaptations.[54] These compressed tar files can be generated via piping, for example, by streaming tar output directly to a compressor like gzip.