Hubbry Logo
Tar (computing)Tar (computing)Main
Open search
Tar (computing)
Community hub
Tar (computing)
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Tar (computing)
Tar (computing)
from Wikipedia
tar
Original authorBell Laboratories
DevelopersVarious open-source and commercial developers
Initial releaseJanuary 1979; 46 years ago (1979-01)
Stable release(s)
BSD tar3.7.2[1] / 2023-09-12
GNU tar1.35[2] Edit this on Wikidata / 2023-07-18
pdtar1986-10-29[3][4] / 1986-10-29
Plan 9 tar? / ?
star2023-09-28[5] / 2023-09-28
Written inpdtar, star, Plan 9, GNU: C
Operating systemUnix, Unix-like, Plan 9, Microsoft Windows, IBM i
PlatformCross-platform
TypeCommand
LicenseBSD tar: BSD-2-Clause
GNU tar: GPL-3.0-or-later
pdtar: Public domain
Plan 9: MIT
star: CDDL-1.0
tar
Filename extension
.tar
Internet media type
application/x-tar
Uniform Type Identifier (UTI)public.tar-archive
Magic numberu s t a r \0 0 0  at byte offset 257 (for POSIX versions)

u s t a r \040 \040 \0  (for old GNU tar format)[6]

absent in pre-POSIX versions
Latest release
various
various
Type of formatFile archiver
StandardPOSIX since POSIX.1, presently in the definition of pax[1]
Open format?Yes

In computing, tar is a shell command for combining multiple computer files into a single archive file. It was originally developed for magnetic tape storage – reading and writing data for a sequential I/O device with no file system, and the name is short for the format description "tape archive". When stored in a file system, a file that tar reads and writes is often called a tarball.

A tarball contains metadata for the contained files including the name, ownership, timestamps, permissions and directory organization. As a file containing other files with associated metadata, a tarball is useful for software distribution and backup.

POSIX abandoned tar in favor of pax, yet tar continues to have widespread use.

History

[edit]

The command was introduced to Unix in January 1979, replacing the tp program (which in turn replaced "tap").[7] The file structure was standardized in POSIX.1-1988[8] and later POSIX.1-2001,[9] and became a format supported by most modern file archiving utilities. The tar command was abandoned in POSIX.1-2001 in favor of pax, which was to support the ustar file format, and tar was indicated for withdrawal in favor of pax at least since 1994. None-the-less, many operating systems today include tools for tar files, as well as tools to compress and decompress them, such as xz, gzip, and bzip2.

The tar command was ported to the IBM i operating system.[10]

BSD-tar has been in Windows since 2018,[11][12] and there are other third-party tools available for Windows.

Rationale

[edit]

Many historic tape drives read and write variable-length data blocks, leaving significant wasted space on the tape between blocks (for the tape to physically start and stop moving). Some tape drives (and raw disks) support only fixed-length data blocks. Also, when writing to any medium such as a file system or network, it takes less time to write one large block than many small blocks. Therefore, the tar command writes data in records of many 512 B blocks. The user can specify a blocking factor, which is the number of blocks per record. The default is 20, producing 10 KiB records.[13]

File format

[edit]

There are multiple tar file formats, including historical and current ones. Two tar formats are codified in POSIX: ustar and pax. Not codified but still in current use is the GNU tar format.

A tar archive consists of a series of file objects, hence the popular term tarball, referencing how a tarball collects objects of all kinds that stick to its surface. Each file object includes any file data, and is preceded by a 512-byte header record. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes. The original tar implementation did not care about the contents of the padding bytes, and left the buffer data unaltered, but most modern tar implementations fill the extra space with zeros.[14] The end of an archive is marked by at least two consecutive zero-filled records. (The origin of tar's record size appears to be the 512-byte disk sectors used in the Version 7 Unix file system.) The final block of an archive is padded out to full length with zeros.

[edit]

The file header record contains metadata about a file. To ensure portability across different architectures with different byte orderings, the information in the header record is encoded in ASCII. Thus if all the files in an archive are ASCII text files, and have ASCII names, then the archive is essentially an ASCII text file (containing many NUL characters).

The fields defined by the original Unix tar format are listed in the table below. The link indicator/file type table includes some modern extensions. When a field is unused it is filled with NUL bytes. The header uses 257 bytes, then is padded with NUL bytes to make it fill a 512 byte record. There is no "magic number" in the header, for file identification.

Pre-POSIX.1-1988 (i.e. v7) tar header:

Field offset Field size Field
0 100 File path and name
100 8 File mode (octal)
108 8 Owner's numeric user ID (octal)
116 8 Group's numeric user ID (octal)
124 12 File size in bytes (octal)
136 12 Last modification time in numeric Unix time format (octal)
148 8 Checksum for header record
156 1 Link indicator (file type)
157 100 Name of linked file

The pre-POSIX.1-1988 Link indicator field can have the following values:

Link indicator field
Value Meaning
'0' or (ASCII NUL) Normal file
'1' Hard link
'2' Symbolic link

Some pre-POSIX.1-1988 tar implementations indicated a directory by having a trailing slash (/) in the name.

Numeric values are encoded as octal numbers using ASCII digits, with leading zeroes. Whilst this choice may seem counterintuitive in modern times when the four bit hexadecimal notation is generally preferred because register sizes and address widths in computers are almost always powers of two and bytes have standardized to be octets, Unix was originally developed for the PDP-7, which uses an 18-bit CPU and a six-bit character code, making the three bit octal notation more desirable.

For historical reasons, a final NUL or space character should also be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, in 2001 star introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field.[citation needed] GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.

The checksum is calculated by taking the sum of the unsigned byte values of the header record with the eight checksum bytes taken to be ASCII spaces (decimal value 32). It is stored as a six digit octal number with leading zeroes followed by a NUL and then a space. Various implementations do not adhere to this format. In addition, some historic tar implementations treated bytes as signed. Implementations typically calculate the checksum both ways, and treat it as good if either the signed or unsigned sum matches the included checksum.

Unix filesystems support multiple links (names) for the same file. If several such files appear in a tar archive, only the first one is archived as a normal file; the rest are archived as hard links, with the "name of linked file" field set to the first one's name. On extraction, such hard links should be recreated in the file system.

UStar format

[edit]

Most modern tar programs read and write archives in the UStar (Unix Standard TAR[7][15]) format, introduced by the POSIX IEEE P1003.1 standard from 1988. It introduced additional header fields. Older tar programs will ignore the extra information (possibly extracting partially named files), while newer programs will test for the presence of the "ustar" string to determine if the new format is in use.

The UStar format allows for longer file names and stores additional information about each file. The maximum filename size is 255, but it is split among a preceding path "filename prefix" and the filename itself, so can be much less.[16]

Field offset Field size Field
0 156 (Several fields, same as in the old format)
156 1 Type flag
157 100 (Same field as in the old format)
257 6 UStar indicator, "ustar", then NUL
263 2 UStar version, "00"
265 32 Owner user name
297 32 Owner group name
329 8 Device major number
337 8 Device minor number
345 155 Filename prefix

The type flag field can have the following values:

Type flag field
Value Meaning
'0' or (ASCII NUL) Normal file
'1' Hard link
'2' Symbolic link
'3' Character special
'4' Block special
'5' Directory
'6' FIFO
'7' Contiguous file
'g' Global extended header with meta data (POSIX.1-2001)
'x' Extended header with metadata for the next file in the archive (POSIX.1-2001)
'A'–'Z' Vendor specific extensions (POSIX.1-1988)
All other values Reserved for future standardization

POSIX.1-1988 vendor specific extensions using link flag values 'A'–'Z' partially have a different meaning with different vendors and thus are seen as outdated and replaced by the POSIX.1-2001 extensions that also include a vendor tag.

Type '7' (Contiguous file) is formally marked as reserved in the POSIX standard, but was meant to indicate files which ought to be contiguously allocated on disk. Few operating systems support creating such files explicitly, and hence most TAR programs do not support them, and will treat type 7 files as if they were type 0 (regular). An exception is older versions of GNU tar, when running on the MASSCOMP RTU (Real Time Unix) operating system, which supported an O_CTG flag to the open() function to request a contiguous file; however, that support was removed from GNU tar version 1.24 onwards.

POSIX.1-2001/pax

[edit]

In 1997, Sun proposed a method for adding extensions to the tar format. This method was later accepted for the POSIX.1-2001 standard. This format is known as extended tar format or pax format. The new tar format allows users to add any type of vendor-tagged vendor-specific enhancements. The following tags are defined by the POSIX standard:

  • atime, mtime: all timestamps of a file in arbitrary resolution (most implementations use nanosecond granularity)
  • path: path names of unlimited length and character set coding
  • linkpath: symlink target names of unlimited length and character set coding
  • uname, gname: user and group names of unlimited length and character set coding
  • size: files with unlimited size (the historic tar format is 8 GB)
  • uid, gid: userid and groupid without size limitation (the historic tar format is limited to a max. id of 2097151)
  • a character set definition for path names and user/group names (UTF-8)

In 2001, the Star program became the first tar to support the new format.[citation needed] In 2004, GNU tar supported the new format,[17] though it does not write it as its default output from the tar program yet.[18]

The pax format is designed so that all implementations able to read the UStar format will be able to read the pax format as well. The only exceptions are files that make use of extended features, such as longer file names. For compatibility, these are encoded in the tar files as special x or g type files, typically under a PaxHeaders.XXXX directory.[19]: exthdr.name  A pax-supporting implementation would make use of the information, while non-supporting ones like 7-Zip would process them as additional files.[20]

Features of the archival utilities

[edit]

Besides creating and extracting archives, the functionality of the various archival utilities varies. For example, implementations might automatically detect the format of compressed TAR archives for extraction so the user does not have to specify it, and let the user limit adding files to those modified after a specified date.[21][22]

Uses

[edit]

Command syntax

[edit]
tar [-options] <name of the tar archive> [files or directories which to add into archive]

Basic options:

  • -c, --create — create a new archive;
  • -a, --auto-compress — additionally compress the archive with a compressor which will be automatically determined by the file name extension of the archive. If the archive's name ends with *.tar.gz then use gzip, if *.tar.xz then use xz, *.tar.zst for Zstandard etc.;
  • -r, --append — append files to the end of an archive;
  • -x, --extract, --get — extract files from an archive;
  • -f, --file — specify the archive's name;
  • -t, --list — show a list of files and folders in the archive;
  • -v, --verbose — show a list of processed files.

Basic usage

[edit]

Create an archive file archive.tar from the file README.txt and directory src:

$ tar -cvf archive.tar README.txt src

Extract contents for the archive.tar into the current directory:

$ tar -xvf archive.tar

Create an archive file archive.tar.gz from the file README.txt and directory src and compress it with gzip :

$ tar -cavf archive.tar.gz README.txt src

Extract contents for the archive.tar.gz into the current directory:

$ tar -xvf archive.tar.gz

Tarpipe

[edit]

A tarpipe is the method of writing an archive to standard output and piping it to another tar process on its standard input, working in another directory, where it is unpacked. This process copies an entire source directory tree including all special files, for example:

$ tar cf - srcdir | tar x -C destdir

Software distribution

[edit]

The tar format continues to be used extensively for open-source software distribution. *NIX-distributions use it in various source- and binary-package distribution mechanisms, with most software source code made available in compressed tar archives.[citation needed]

Limitations

[edit]

The original tar format was created in the early days of Unix, and despite current widespread use, many of its design features are considered dated.[23]

Other formats have been created to address the shortcomings of tar.

File names

[edit]

Due to the field size, the original TAR format was unable to store file paths and names in excess of 100 characters.

To overcome this problem while maintaining readability by existing TAR utilities, GNU tar stores file paths and names in excess of the 100 characters are stored in @LongLink entries that would be seen as ordinary files by TAR utilities unaware of this feature.[24] Similarly, the PAX format uses PaxHeaders entries.[25]

Attributes

[edit]

Many older tar implementations do not record nor restore extended attributes (xattrs) or access-control lists (ACLs). In 2001, Star introduced support for ACLs and extended attributes, through its own tags for POSIX.1-2001 pax. bsdtar uses the star extensions to support ACLs.[26] More recent versions of GNU tar support Linux extended attributes, reimplementing star extensions.[27] A number of extensions are reviewed in the filetype manual for BSD tar, tar(5).[26]

Tarbomb

[edit]

A tarbomb, in hacker slang, is a tarball containing a large number of items whose contents are written to the current directory or some other existing directory when untarred instead of the directory created by the tarball specifically for the extracted outputs.[28] It is at best an inconvenience to the user, who is obliged to identify and delete a number of files interspersed with the directory's other contents. Such behavior is considered bad etiquette on the part of the archive's creator.

A related problem is the use of absolute paths or parent directory references when creating tar files. Files extracted from such archives will often be created in unusual locations outside the working directory and, like a tarbomb, have the potential to overwrite existing files. However, modern versions of FreeBSD and GNU tar do not create or extract absolute paths and parent-directory references by default, unless it is explicitly allowed with the flag -P or the option --absolute-names. The bsdtar program, which is also available on many operating systems and is the default tar implementation on Mac OS X v10.6, also does not follow parent-directory references or symbolic links.[29][failed verification]

If a user has only a very old tar available, which does not feature those security measures, these problems can be mitigated by first examining a tar file using the command tar tf archive.tar, which lists the contents and allows to exclude problematic files afterwards. These commands do not extract any files, but display the names of all files in the archive. If any are problematic, the user can create a new empty directory and extract the archive into it—or avoid the tar file entirely. Most graphical tools can display the contents of the archive before extracting them. Vim can open tar archives and display their contents. GNU Emacs is also able to open a tar archive and display its contents in a dired buffer.

Random access

[edit]

The tar format was designed without a centralized index or table of content for files and their properties for streaming to tape backup devices. Instead, the metadata for each file (such as name, size, time stamps) for each file is stored in a header before each file. The archive must be read sequentially to list or extract files. For large tar archives, this causes a performance penalty, making tar archives unsuitable for situations that often require random access to individual files.

In turn, this design makes TAR archives resilient against damage from missing portions, in both the form of digital files and physical tape.[citation needed] A truncated TAR file with missing parts on either ends still allows recovering the parts that are not missing, including the file paths and file names and metadata, by starting from the first TAR header that is not missing.[30]

With a well-formed tar file stored on a seekable (i.e. allows efficient random reads) medium, the tar program can still relatively quickly (in linear time relative to file count) look for a file by skipping file reads according to the "size" field in the file headers. This is the basis for option -n in GNU tar. When a tar file is compressed whole, the compression format, being usually non-seekable, prevents this optimization from being done.[31] To maintain seekability, tar files must be also concatenated properly, by removing the trailing zero block at the end of each file.[32]

Duplicates

[edit]

Another issue with tar format is that it allows several (possibly different) files in archive to have identical paths and filenames. When extracting such archive, usually the latter version of a file overwrites the former.

This can create a non-explicit (unobvious) tarbomb, which technically does not contain files with absolute paths or referring to parent directories, but still causes overwriting files outside current directory (for example, archive may contain two files with the same path and filename, first of which is a symlink to some location outside current directory, and second of which is a regular file; then extracting such archive on some tar implementations may cause writing to the location pointed to by the symlink).

Key implementations

[edit]

Historically, many systems have implemented tar, and many general file archivers have at least partial support for tar (often using one of the implementations below). The history of tar is a story of incompatibilities, known as the "tar wars". Most tar implementations can also read and create cpio and pax (the latter actually is a tar-format with POSIX-2001-extensions).

Key implementations in order of origin:

  • Solaris tar, based on the original Unix V7 tar and comes as the default on the Solaris operating system
  • GNU tar is the default on most Linux distributions. It is based on the public domain implementation pdtar which started in 1987. Recent versions can use various formats, including ustar, pax, GNU and v7 formats.
  • FreeBSD tar (also BSD tar) has become the default tar on most Berkeley Software Distribution-based operating systems including Mac OS X. The core functionality is available as libarchive for inclusion in other applications. This implementation automatically detects the format of the file and can extract from tar, pax, cpio, zip, rar, ar, xar, rpm and ISO 9660 cdrom images. It also comes with a functionally equivalent cpio command-line interface.
  • Schily tar, better known as star (/ˈɛsˌtɑːr/, ESS-tar),[33] is historically significant as some of its extensions were quite popular. First published in April 1997,[34] its developer has stated that he began development in 1982.[35]
  • Python tarfile module supports multiple tar formats, including ustar, pax and gnu; it can read but not create V7 format and the SunOS tar extended format; pax is the default format for creation of archives.[36] Available since 2003.[37]

Additionally, most pax and cpio implementations can read and create multiple types of tar files.

Suffixes for compressed files

[edit]

tar archive files usually have the file suffix .tar (e.g. somefile.tar).

A tar archive file contains uncompressed byte streams of the files which it contains. To achieve archive compression, a variety of compression programs are available, such as gzip, bzip2, xz, lzip, lzma, zstd, or compress, which compress the entire tar archive. Typically, the compressed form of the archive receives a filename by appending the format-specific compressor suffix to the archive file name. For example, a tar archive archive.tar, is named archive.tar.gz, when it is compressed by gzip.

Popular tar programs like the BSD and GNU versions of tar support the command-line options Z (compress), z (gzip), and j (bzip2) to compress or decompress the archive file upon creation or unpacking. Relatively recent additions include --lzma (LZMA), --lzop (lzop), --xz or J (xz), --lzip (lzip), and --zstd.[38] The decompression of these formats is handled automatically if supported filename extensions are used, and compression is handled automatically using the same filename extensions if the option --auto-compress (short form -a) is passed to an applicable version of GNU tar.[16] BSD tar detects an even wider range of compressors (lrzip, lz4), using not the filename but the data within.[39] Unrecognized formats are to be manually compressed or decompressed by piping.

MS-DOS's 8.3 filename limitations resulted in additional conventions for naming compressed tar archives. However, this practice has declined with FAT now offering long filenames.

Tar archiving is often used together with a compression method, such as gzip, to create a compressed archive. As shown, the combination of the files in the archive is compressed as one unit.
File suffix equivalents[16]
Compressor Long Short
bzip2 .tar.bz2 .tb2, .tbz, .tbz2, .tz2
gzip .tar.gz .taz, .tgz
lzip .tar.lz
lzma .tar.lzma .tlz
lzop .tar.lzo
xz .tar.xz .txz
compress .tar.Z .tZ, .taZ
zstd .tar.zst .tzst

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In computing, tar (short for tape archiver) is a file format and command-line utility designed to bundle multiple files and directories into a single archive file, originally intended for writing to drives for and storage purposes. The utility first appeared in the Seventh Edition of Unix in January 1979, where it served as a tool to save and restore files on . The tar format consists of a stream of 512-byte blocks containing file headers followed by the file data, enabling the preservation of file permissions, timestamps, and ownership information across systems. In 1988, the POSIX.1 standard formalized the format as "ustar" (Unix Standard Tape ARchive), which extended the original design to support longer filenames, symbolic links, and device files while ensuring portability. Key operations of the tar utility include creating new archives (c option), extracting files from archives (x option), listing archive contents (t option), appending files (r option), and updating archives (u option), all controlled via command-line options and supporting blocking factors for tape devices. The basic tar format limits individual file sizes to 8 gigabytes and pathnames to 256 characters, though modern implementations like tar extend this with formats such as GNU-specific headers or the POSIX.1-2001 pax interchange format for larger files and extended attributes. Despite its origins in tape archiving, tar remains a foundational tool in operating systems for , system backups, and data packaging, often combined with compression utilities like (resulting in .tar.gz files) or (.tar.bz2) to reduce archive size. The standard recommends migrating to the pax utility for enhanced portability, but tar continues to be widely used due to its simplicity and backward compatibility.

Background

History

The tar utility originated in the Unix Seventh Edition (V7), released by in January 1979, where it was introduced as a simple tool for creating tape archives as part of the system's backup capabilities, replacing the earlier tp program. It was designed to bundle multiple files into a single archive stream suitable for storage, complementing earlier backup tools like dump, which focused on incremental filesystem dumps. This initial implementation emphasized portability and ease of use for archiving directories and files onto tape devices, marking a shift toward standardized archiving in early Unix environments. In the early 1980s, as Unix variants proliferated beyond AT&T's proprietary systems, the need for freely redistributable software grew. John Gilmore developed a public-domain of tar in late 1987, initially as pdtar, which was posted to and became highly influential for its clean code and compatibility with emerging standards drafts. This version addressed limitations in proprietary implementations and facilitated adoption in academic and open-source communities, laying the groundwork for further enhancements. By 1987, Gilmore's pdtar had evolved into the basis for GNU tar, first released in 1988 as part of the Project, introducing features like multi-volume support to handle archives spanning multiple tapes or disks during the era of limited storage capacities. BSD variants, such as those in 4.3BSD (1986), also incorporated and extended tar with improvements for network file systems and longer pathnames, contributing to its divergence across Unix lineages. Standardization efforts began in the late 1980s to unify tar's behavior across Unix systems. The utility was included in .1-1988 (IEEE Std 1003.1-1988), which defined the basic tar format and command interface, including the USTAR (Unix Standard Tape ARchive) format for better support of long filenames, permissions, symbolic links, and device files, ensuring for core operations like archiving and extraction. Subsequent POSIX revisions expanded on this foundation: POSIX.1-2001 introduced pax extensions for enhanced portability, including global extended headers for attributes beyond the original limits. These standards, maintained through The Open Group, continued evolving; .1-2024, published in , reaffirms tar's role while incorporating modern filesystem considerations, ensuring its relevance in contemporary systems.

Rationale

The tar command emerged in early Unix systems to address the need for a straightforward, portable mechanism to bundle multiple files and directories into a single archive, facilitating backups and transfers across limited hardware environments like those at in the late 1970s. This design responded to the practical demands of Unix developers who required a tool capable of handling file collections without relying on complex proprietary formats, ensuring compatibility across different Unix implementations and even non-Unix systems through simple binary streams. By focusing on core archiving functionality, tar enabled efficient storage on sequential media and easy distribution, aligning with the era's resource constraints and emphasis on . A key motivation in tar's design was the preservation of essential file metadata, such as permissions, timestamps, details, and directory structures, to allow faithful restoration of files in their original configuration. This capability was vital for system administration tasks, where altering metadata could compromise security or functionality, and it distinguished tar from simpler concatenation tools by providing a reliable way to capture the full context of objects. Without such preservation, backups would lose critical attributes, rendering restores incomplete or insecure. Originally tailored for tape archiving—hence the name "tape archiver"—tar's format was optimized for sequential, appendable operations on magnetic tapes, the predominant medium in early labs. This choice prioritized streamability over , allowing archives to be written and read in a continuous flow suitable for tape drives, while later adaptations extended its use to disk files and network pipes without fundamental changes. The design reflected a deliberate : simplicity and modularity over integrated features like compression or indexing, encouraging composition with separate tools (e.g., compress or ) to maintain a lean core while supporting extensible workflows. Tar's development was influenced by the shortcomings of predecessor tools, aiming to create more robust, streamable archives that could handle growing complexities in evolving Unix versions. By emphasizing portability and appendability, it overcame earlier limitations in backup utilities, establishing a format that remains foundational for systems despite shifts in storage technology.

File Format

Basic Header Structure

The basic header structure in a tar archive uses a fixed 512-byte block for each file or directory entry, providing essential metadata in a contiguous, ASCII-encoded to ensure portability across systems. This block precedes the file's data blocks (if any) and is designed for sequential reading from tape archives, with all fields occupying exact byte positions without variable-length encoding. The structure supports core attributes like permissions, , , and timestamps, while the remaining bytes after the defined fields are filled with null bytes (0x00) for padding to reach precisely 512 bytes. Key fields in the header include the filename (bytes 0-99, up to 100 characters, null-terminated if shorter), file mode (bytes 100-107, 8 bytes representing octal permissions), user ID (bytes 108-115, 8 bytes in octal), group ID (bytes 116-123, 8 bytes in octal), file size (bytes 124-135, 12 bytes in octal for the byte length of the file data), modification time (bytes 136-147, 12 bytes in octal as seconds since the Unix epoch), and link name (bytes 157-256, 100 bytes for the target path in case of links). The link indicator (byte 156, 1 byte) is NUL (ASCII 0) or space for regular files and directories, and '1' for hard links. Numeric fields like size, mode, UID, GID, and mtime are encoded as right-justified octal strings in printable ASCII digits, padded with leading spaces (0x20) to their full width, and typically terminated by a space or null byte. This legacy 100-byte limit on filenames and link names restricts paths to relatively short lengths, often requiring workarounds for longer names in modern use. The checksum field (bytes 148-155, 8 bytes in ) ensures by verifying the header itself; it is computed as the sum of all 512 bytes treated as unsigned characters, but with the checksum field temporarily filled with eight space characters (0x20) during calculation, excluding the actual checksum bytes. The resulting sum is then converted to an 8-byte (right-justified, leading spaces, terminated by space or null) and inserted into the field. Upon reading, the process is reversed to validate the header against corruption. To denote the end of the archive, tar formats require two consecutive 512-byte blocks filled entirely with binary zeros (0x00), serving as an explicit terminator regardless of the number of entries or padding. This marker allows readers to detect the archive's conclusion even if the underlying storage (like tape) ends abruptly.
FieldBytesLength (bytes)FormatDescription
name0-99100ASCII string, null-paddedFilename or directory path
mode100-1078Octal ASCII, space-paddedFile permissions (e.g., 0644)
uid108-1158Octal ASCII, space-paddedUser ID (owner)
gid116-1238Octal ASCII, space-paddedGroup ID
size124-13512Octal ASCII, space-paddedFile size in bytes (0 for directories)
mtime136-14712Octal ASCII, space-paddedModification time (Unix timestamp)
chksum148-1558Octal ASCII, space-paddedHeader checksum
typeflag1561ASCII characterLink indicator (NUL or space for files/dirs, '1' for hard links)
linkname157-256100ASCII string, null-paddedTarget path for links (unused for regular files)
padding257-511255Null bytes (0x00)Unused space
This table outlines the fixed layout of the basic header, totaling 512 bytes, as defined in early Unix implementations and preserved for backward compatibility.

UStar Format

The UStar format, short for Unix Standard Tape ARchive, was introduced in the POSIX.1-1988 standard to enhance portability of tar archives across Unix systems, addressing limitations in the original format such as short filename lengths and lack of support for user and group names. This extension builds on the basic 512-byte header block structure while adding fields to support longer paths and additional metadata, enabling filenames up to 256 characters through a combination of a 100-byte name field and a new 155-byte prefix field that precedes the filename with a slash separator. Key additions in the UStar header include a 6-byte magic field set to "ustar" followed by a null byte, and an 2-byte version field set to "00", which identify the format and ensure recognition by compliant tools. It also introduces 32-byte fields for (user name) and gname (group name), allowing archival of ownership information beyond numeric IDs, as well as 8-byte fields for devmajor and devminor to represent device numbers for special files like character and block devices. The format supports POSIX device types through type flags in a single-byte field, including '3' for character special files, '4' for block special files, and '7' for contiguous files that can be treated as regular files for improved performance on certain media. Extended attributes, such as access control lists (ACLs), can be handled via the format's provisions for future extensions, though full ACL support was refined in later standards. The field in UStar extends the original method by calculating an 8-byte sum over all 512 bytes of the header, treating the checksum field itself as filled with spaces during computation to avoid , which improves integrity verification and includes previously ignored fields like the prefix. For backward with pre-UStar tar readers, the format maintains the core structure and positions the new fields in unused space of the original header, allowing older tools to ignore unknown bytes while still extracting basic file data. This ensures UStar archives remain readable on legacy systems without requiring format conversion.
Field NameOffset (bytes)Length (bytes)Description
prefix345155Path prefix for long filenames (octal-padded, null-terminated)
2576"ustar" followed by null
version2632"00"
uname26532User name (null-terminated)
gname29732Group name (null-terminated)
devmajor3298Device major number ()
devminor3378Device minor number ()

POSIX.1-2001 and pax Extensions

The POSIX.1-2001 standard introduced significant extensions to the tar archive format through the pax interchange format, enabling support for modern file systems and attributes beyond the limitations of prior formats. This maintains with ustar archives while adding flexibility via extended header records, which precede the regular file data and allow for the storage of additional metadata. Extended headers in the pax format utilize specific type flags to encode information: type flag 'x' denotes a per-file extended header, containing metadata applicable only to the immediately following file, while type flag 'g' indicates a global extended header that applies to all subsequent files in the archive until overridden. These headers consist of ASCII key-value pairs, where keys are standardized keywords (such as path for filenames, size for file sizes, mtime for modification times, uid and gid for ownership, and linkpath for links) or vendor-specific extensions prefixed with vendor identifiers, separated by an equals sign from their decimal or UTF-8 encoded values. The key-value structure permits arbitrary attributes, overcoming ustar field length restrictions—for instance, the path keyword supports filenames and paths exceeding 256 characters, and the size keyword enables representation of files larger than 8 GB using arbitrary-length decimal strings rather than fixed octal fields. The pax format supports sparse files through implementation-defined keywords in extended headers, such as GNU.sparse.map in tar implementations, to describe maps of allocated blocks and holes (unallocated regions) and optimize storage by omitting zero-filled holes. Global extended headers, via type flag 'g', facilitate archive-wide metadata, such as user ID mappings ( and gname keywords linking numeric IDs to symbolic names) or default attributes applied across multiple files, enhancing portability in heterogeneous environments. For incremental archiving, implementations may leverage directory modification times (mtime) to identify changed files since the last backup, though specific mechanisms vary by tool. Subsequent revisions refined these extensions for better and robustness. POSIX.1-2008 mandated encoding for all textual fields in extended headers, including paths and names, to ensure consistent handling of international characters across locales. POSIX.1-2017 further emphasized security enhancements, such as recommendations for implementations to validate paths against traversal attempts (e.g., rejecting entries with leading slashes or excessive parent directory references) to mitigate risks like tarbomb extractions.

Core Functionality

Key Features

The tar utility is designed to preserve the hierarchical directory structure of filesystems, maintaining the full tree organization including subdirectories and their relative paths during archiving and extraction. This capability ensures that the archived files can be restored to their original layout without loss of organizational integrity, distinguishing tar from simpler concatenation tools. A core strength of tar lies in its retention of essential file metadata, including permissions (stored as the mode field), ownership information (user ID and group ID via uid and gid fields), modification timestamps (mtime), and support for both hard and symbolic links (indicated by type flags such as '1' for hard links and '2' for symlinks). These elements are encoded in the archive header for each member, allowing faithful reproduction of upon extraction, which is critical for system backups and . Tar supports multi-volume archives, enabling the creation of large archives split across multiple media or files, such as tapes, by automatically prompting for volume changes and continuing the operation seamlessly. This feature, facilitated through options like --multi-volume, accommodates storage limitations on older or constrained devices while maintaining archive integrity across volumes. The append mode allows users to add new files to an existing tar archive without needing to extract and recreate it, using mechanisms that update the archive incrementally while preserving the original contents. This efficiency is particularly useful for ongoing scenarios where only changes need to be incorporated. For incremental backups, tar provides support through the --listed-incremental option, which uses snapshot files to compare and archive only modified or new files since the last , based on and inode comparisons. This method optimizes storage and time by avoiding full re-archiving of unchanged data. Due to its adherence to standardized formats like POSIX.1-1988 and subsequent extensions, tar exhibits high portability across operating systems, ensuring archives created on one platform can be reliably read and extracted on another without format incompatibilities.

Command Syntax

The tar command employs the general syntax tar [options...] [archive-file] [files...], where options define the operation and modifiers, the optional archive-file specifies the target archive (defaulting to standard output or an environment-defined device), and files... lists the paths to process or patterns for selection. Options fall into key categories, including action modes that determine the primary operation: -c or --create to form a new archive from specified files; -x or --extract (or --get) to unpack files from an existing ; -t or --list to display archive contents without extraction; -r or --append to add files to the end of the archive; and -u or --update to append only files newer than their counterparts in the archive. File selection options refine which paths are included or excluded, such as --exclude=[PATTERN](/page/Pattern) to skip files matching a given pattern and --include=[PATTERN](/page/Pattern) to limit processing to matching files only. Output control options manage archive handling and working directories, notably -f, --file=NAME to designate the archive file or device and -C, --directory=DIR to switch to directory DIR prior to each file operation. GNU tar supports both short and long option forms for flexibility, with short options prefixed by a single (e.g., -f) and long options by two (e.g., --file=NAME); short options lacking arguments can be bundled consecutively after a single (e.g., -cf combining --create and --file), while those requiring arguments must follow immediately (e.g., -farchive.tar). The order of options matters minimally except for the primary action mode, which must appear before operands, and tar processes non-option arguments as file names after all options. Environment variables influence default behaviors, such as TAPE, which sets the archive name or device when -f is omitted, allowing invocation without explicit file specification. For error handling, flags like --warning=KEYWORD (or -w) enable or suppress warnings for non-fatal conditions, with keywords such as no-file-changed to alert on unsuccessful file updates without halting execution. Options like --multi-volume further support features such as spanning archives across multiple media.

Basic Operations

The basic operation for creating a tar archive involves using the --create (or -c) option combined with --file (or -f) to specify the archive name, followed by the paths of the files or directories to include. For example, the command tar -cf archive.tar file1.txt file2.txt bundles the specified files into a single archive file named archive.tar. When including directories, such as tar -cf archive.tar directory/, the command recursively adds all contents within that directory while preserving the internal structure. Wildcards can be used in the file list to select multiple items efficiently, as the shell expands them before tar processes the arguments; for instance, tar -cf archive.tar *.txt archives all files ending in .txt in the current directory. Regarding paths, GNU tar by default stores relative paths in the archive and strips any leading slash from absolute paths to avoid embedding the full filesystem hierarchy, ensuring portability; however, the -P or --absolute-names option can be used if absolute paths must be preserved. This behavior helps prevent issues when extracting archives on different systems. To extract files from a tar archive, the --extract (or -x) option is employed alongside -f to specify the archive, as in tar -xf archive.tar, which restores the contents to the current directory while recreating the original directory structure. For path adjustment during extraction, the --strip-components=N option removes the first N leading components from member names; for example, tar -xf archive.tar --strip-components=1 discards the top-level directory, placing files directly in the current directory instead. Listing the contents of an without extracting uses the --list (or -t) option with -f, such as tar -tf archive.tar, which outputs the names of all members in the . Adding the -v or --verbose flag provides detailed metadata, including permissions, , sizes, and modification times, as in tar -tvf archive.tar, allowing inspection of details without modifying the filesystem. A common pitfall during extraction is the potential overwriting of existing files with the same names, which tar performs by default without prompting. This can be mitigated using the --keep-old-files (or -k) option, which treats such conflicts as errors and skips replacement, preserving the original files; for instance, tar -xkf archive.tar will halt or skip on duplicates rather than overwriting. Alternatively, --skip-old-files silently ignores existing files without erroring, suitable for non-interactive scripts.

Practical Applications

Piping and Compression Integration

One of the key strengths of tar lies in its ability to integrate seamlessly with compression utilities through Unix , enabling the creation of compressed archives without generating intermediate uncompressed files. This process, often referred to as tar piping, involves directing tar's output stream to a compressor in real-time. For instance, the command tar cf - directory/ | > archive.tar.gz` creates an uncompressed tar stream from the specified directory and pipes it directly to for compression, producing a .tar.gz file efficiently. This streaming approach leverages the Unix pipe mechanism to process data on-the-fly, minimizing temporary storage needs and reducing overall disk I/O, which is particularly beneficial for handling large datasets or when working in resource-constrained environments. GNU tar enhances this integration by providing built-in command-line flags that automate the piping to common compression tools, eliminating the need for explicit pipe syntax in many cases. The -z flag invokes for both creation (tar czf archive.tar.gz directory/) and extraction (tar xzf archive.tar.gz), while -j pairs with (tar cjf archive.tar.bz2 directory/) and -J with xz (tar cJf archive.tar.xz directory/) for higher compression ratios at the cost of increased CPU usage. These options support auto-detection of the compression format based on file extensions during extraction, allowing tar to transparently invoke the appropriate decompressor. Bidirectional piping extends this flexibility to extraction workflows, such as gunzip -c archive.tar.gz | tar xf -, which decompresses the input stream and feeds it to tar for unpacking without writing the decompressed tar to disk. Historically, early tar implementations required manual piping to external compressors like compress or as separate steps, but tar introduced integrated flags starting with the -z option for in versions around 1992, marking a shift toward streamlined, user-friendly compressed archiving. This evolution improved workflow efficiency and popularized compressed tar formats in systems, as the combined operations reduce processing time and storage overhead for backups and distributions.

Software Distribution and Packaging

Tar archives, commonly known as tarballs, play a central role in source code distribution for projects, bundling source files, build scripts, , and configuration files into a single, portable file while preserving file permissions, ownership, and directory structures essential for automated builds. In systems using , such as and , the make dist target generates a compressed tar archive (typically .tar.gz) that includes all necessary components for compilation and installation on various systems, ensuring reproducibility without requiring metadata. This format has been a standard for distributing software since the late 1980s, when the Project began releasing tools and utilities in tarball form to facilitate sharing and modification. Historically, tarballs have been integral to major software releases; for instance, programs like the original tar utility itself were distributed via tar archives starting from its early versions in the 1980s, aligning with the project's goal of creating a free operating system. Similarly, the sources have been provided as tarballs on since the kernel's inception in 1991, allowing developers worldwide to download, compile, and contribute to the codebase with consistent file integrity and structure. Tar serves as a foundational component in several binary package formats used for . In Debian-based systems, .deb packages encapsulate their in a data.tar , typically compressed with , xz, or (e.g., data.tar.gz, data.tar.xz, data.tar.zst), which contains the installed files and is extracted during package installation to place binaries, libraries, and resources in the appropriate system directories. For RPM-based distributions, source RPMs (.src.rpm) incorporate upstream tarballs as the primary source , which rpmbuild unpacks during the preparation phase to apply patches and build binaries. AppImages, a format, are often constructed from extracted tar.gz bundles using tools like pkg2appimage, enabling self-contained executables that run without system-wide installation. Best practices for creating distribution tarballs emphasize cleanliness and security to avoid including unnecessary or sensitive data. Developers routinely exclude version control directories like .git using the --exclude-vcs option in , preventing the inclusion of repository history that could bloat the archive or expose private information. Additionally, signing tarballs with tools like GPG or minisign is recommended to verify integrity and authenticity, as outlined in guidelines for source packages, where detached signatures accompany the archive for user validation. In modern contexts, tar extends to and cloud environments; Docker container images are layered using tar archives for efficient storage and transfer, with each layer representing an immutable filesystem snapshot that can be imported or exported via docker save and docker load. Cloud platforms like leverage tar for packaging application artifacts during builds and deployments, streaming archives to build nodes for rapid assembly into container images. Tarballs are often compressed with or xz to reduce download sizes in these workflows.

Limitations

Path and Filename Handling

The original tar format, derived from , limits filenames to 100 bytes, including the null terminator, which restricts paths to relatively short names without support for longer hierarchies or prefixes. This constraint often leads to truncation or errors when archiving files with extended paths, as the header block allocates exactly 100 bytes for the name field. The UStar format, standardized in POSIX.1-1988, extends this capability by introducing a 155-byte prefix field for directory paths, allowing a total pathname length of up to 256 bytes when combined with the 100-byte name field and a separating slash. In this structure, the prefix holds the leading directory components, while the name field stores the basename, enabling better support for deeper directory trees without altering the core header size. The POSIX.1-2001 pax format further removes these limits by using extended header records to store arbitrary-length pathnames and filenames as key-value pairs before the file , supporting paths of effectively unlimited size in compliant implementations. Tar archives can pose path traversal risks during extraction if they contain absolute paths (starting with '/') or sequences like '../' that navigate outside the intended directory. By default, GNU tar strips the leading '/' from absolute paths to prevent writing to the filesystem root, but enabling the --absolute-names option restores them, potentially allowing overwrites in sensitive locations. Similarly, '../' sequences enable upward traversal, which may overwrite files in parent directories if extraction occurs without isolation, such as in a non-empty . A tarbomb refers to a maliciously crafted tar archive designed to scatter files across the filesystem upon extraction, often using relative paths, multiple directory levels, or symlink tricks to clutter or overwrite unintended areas. These can overwhelm storage or compromise system integrity, particularly if extracted by privileged users. Mitigations include GNU tar's --no-overwrite-dir option, which preserves metadata of existing nonempty directories without overwriting their contents, and --keep-old-files, which refuses extraction of conflicting files entirely. Additional safeguards involve extracting to an empty temporary directory or using tools like bsdtar with strict path normalization to block traversal attempts. Legacy tar implementations assume ASCII encoding for filenames, limiting support to 7-bit characters and causing issues with international or extended sets on modern systems. Contemporary tools, such as GNU tar in .1-2001 mode, accommodate by storing filename bytes directly from the filesystem and using extended pax headers to declare encoding metadata, ensuring compatibility with paths. Handling special characters in tar filenames involves no inherent escaping within the archive itself, as tar preserves the raw byte sequence from the source filesystem, but command-line invocation requires shell escaping (e.g., backslashes) to pass names containing spaces, quotes, or glob characters correctly. Portability challenges arise across systems with varying character restrictions; for instance, Windows-derived tools may reject certain or control characters that Unix tolerates, while older Unix variants limit to portable sets like alphanumeric, , and period to avoid decoding errors. To enhance cross-platform reliability, filenames should avoid non-ASCII or control bytes, aligning with recommendations for the portable filename character set.

Attribute and Permission Preservation

The tar archive format stores file attributes in the header block preceding each file's data. The mode field, occupying bytes 100 through 107, is an 8-byte octal string representing the file permissions (nine bits for read, write, and execute access for owner, group, and others) along with three special bits for setuid, setgid, and sticky modes. The user ID (uid) and group ID (gid) fields, at bytes 108-115 and 116-123 respectively, are also 8-byte octal strings storing numeric identifiers. The modification time (mtime) is recorded in bytes 136-147 as a 12-byte octal string denoting seconds since the Unix epoch (January 1, 1970). These fields in the POSIX ustar format support values up to 0777777 (octal) or 2097151 (decimal) for mode, uid, and gid, and 077777777777 (octal) or 8589934591 (decimal) for mtime, limiting timestamps to second-level precision and potentially causing overflow for very large IDs or future dates. During extraction, preserving these attributes presents challenges, particularly for . Setting the original uid and gid requires privileges on systems, as only the can assign arbitrary user and group IDs; without them, tar implementations like GNU tar default to the extracting user's uid and gid. For permissions, the mode is applied where possible, but special bits (, , sticky) are typically ignored or cleared unless extracted by . Fallback mechanisms include mapping usernames () and group names (gname) from the header to local equivalents if they exist, prioritizing names over numeric IDs for compatibility across . Options like --same-owner in GNU tar attempt to restore numerically even for non-root users, but success depends on system capabilities and may result in the extracting user's if mappings fail. Modern tar implementations extend attribute preservation through POSIX.1-2001 pax format and vendor-specific features. GNU tar and pax-compatible tools support extended attributes (xattrs), which store additional metadata such as access control lists (ACLs) and SELinux security labels, using dedicated options like --xattrs, --acls, and --selinux during both creation and extraction. These are archived as supplementary headers in pax format, allowing preservation of filesystem-specific attributes beyond basic POSIX modes. For timestamps, the original ustar format's 11-decimal-digit mtime limit provides only second precision, but extensions in GNU tar and pax formats append nanosecond fields (up to 9 digits) in global or per-file extended headers, enabling sub-second accuracy on supporting filesystems like ext4. Cross-platform extraction introduces further complications, especially between systems and Windows. Unix permissions do not directly map to Windows ACLs, leading to mismatches where executable bits may be lost or directories become read-only; tar on Windows (via or MSYS2) approximates Unix modes but cannot fully replicate them without additional tools. To mitigate this, the --mode option in tar allows overriding extraction modes with a specific value (e.g., --mode=0755), ensuring consistent permissions regardless of the host OS, though ownership and extended attributes remain Unix-centric and often unsupported on Windows.

Security Risks

Tar archives pose several security risks, particularly when extracting untrusted files, as the format lacks inherent mechanisms to prevent malicious content from causing harm. One significant concern involves the potential for automatic execution of scripts contained within the archive. In certain tools and environments, such as specific package management systems or automated installers that process tar files, embedded scripts may be triggered during or immediately after extraction, enabling command injection attacks if the archive is sourced from unverified providers. Additionally, tar does not include built-in support for digital signatures or verification, making it susceptible to tampering or during transmission. Users must therefore depend on external utilities, such as GnuPG (gpg), to validate the archive's authenticity and wholeness before extraction; for instance, signatures are typically provided separately and verified by the archive through gpg. Extracting tar archives with elevated privileges exacerbates these vulnerabilities, as malicious files within the archive—such as binaries or configuration-altering scripts—can gain system-level access, leading to . Performing extractions as the user, a common practice in system administration, can thus transform a seemingly benign archive into a vector for widespread compromise, including unauthorized modifications to critical system components. Historical exploits highlight the long-standing nature of these issues; for example, GNU tar versions prior to 1.13.25 were vulnerable to symlink attacks that enabled arbitrary file overwrites (CVE-2002-1216), allowing attackers to replace sensitive files through crafted symbolic links during extraction. To mitigate these risks, administrators should extract untrusted tar files in isolated environments, such as using sandboxing tools like chroot or unshare namespaces, to contain potential damage. Implementations like bsdtar (from libarchive) offer secure flags, including --no-same-owner to disregard ownership metadata and --no-same-permissions to ignore file permissions from the archive, thereby preventing the inheritance of potentially malicious attributes. Furthermore, scanning archives with antivirus software prior to extraction and avoiding root privileges during the process are essential best practices; the GNU tar documentation explicitly advises against allowing untrusted users to access extracted files without prior inspection for issues like setuid programs.

Access and Duplication Issues

Tar archives are inherently sequential in structure, consisting of a continuous of file headers and without an index or central directory for quick navigation. This design necessitates scanning the entire from the beginning to list contents, extract specific files, or determine file positions, which can be inefficient for large archives or frequent operations. Unlike formats such as ZIP, which include a central directory enabling direct seeking to individual files, tar's stream-oriented approach prioritizes simplicity and compatibility with tape drives but limits performance in scenarios requiring non-linear access. When extracting files, tar handles duplicates—files with names already present in the destination—by default overwriting them without warning, potentially leading to unintended . To mitigate this, tar provides options like --skip-old-files, which silently skips extraction of existing files, and --keep-old-files, which treats existing files as errors and halts the process unless overridden. Additionally, the --warning=existing-files option issues verbose warnings about skipped files, aiding in monitoring potential overwrites during extraction. These behaviors ensure controlled duplication management but require user configuration to avoid silent alterations. Multi-volume tar archives, used for spanning large datasets across multiple media like tapes or disks, impose limitations due to their sequential nature and lack of built-in compression support. Creation or extraction often requires manual intervention, such as prompting the user to insert the next volume when the current one fills, which can disrupt automated workflows. Some implementations, including , offer automation via the --new-volume-script option, allowing scripted handling of volume changes, though this still demands careful setup to manage spans effectively. These constraints make multi-volume tar suitable for scenarios but less ideal for seamless large-scale operations. Scalability challenges arise when archiving large directories, particularly those with millions of small files, as tar may consume significant to build internal file lists or buffers during creation or extraction. For instance, processing directories with over a million files can lead to system crashes due to excessive RAM usage in default configurations. The --one-file-system option addresses this by restricting archiving to the source filesystem, preventing recursive traversal across mount points that could exponentially increase the workload and memory demands. This mitigation enhances performance in multi-filesystem environments but underscores tar's limitations in handling vast, nested structures without additional tuning.

Implementations and Conventions

Major Implementations

GNU tar is the most widely used implementation on systems, providing full support for the .1-2001 archive format through options like --format=posix and extensive extensions for modern features. It integrates compression directly via command-line flags such as -z for and -j for , allowing seamless creation of compressed archives without external tools. Additionally, GNU tar supports incremental backups using snapshot files with --listed-incremental, enabling efficient updates to archives by tracking changes since the last backup. BSD tar, often implemented via the libarchive library with bsdtar as its command-line frontend, emphasizes strict adherence to standards, including IEEE Std 1003.1-2001 for ustar and pax interchange formats. This implementation excels in cross-platform portability, supporting extraction and creation across systems, Windows, and macOS through libarchive's broad format compatibility, which includes , , zip, and more. Unlike tar's more permissive extensions, BSD tar prioritizes standards compliance to ensure reliable , though it may reject non-standard GNU-specific headers. Star, part of the Schily tools suite developed by Joerg Schilling, extends the UStar format with enhanced support for access control lists (ACLs) via the exustar format and Rock Ridge extensions for ISO 9660 CD-ROM archives, improving data integrity and filesystem attribute preservation. It focuses on high performance and robustness, particularly for media archiving, by handling extended attributes like SELinux labels and ensuring backward compatibility with POSIX while adding proprietary keys like SCHILY.acl for ACL storage. Platform-specific variants include tar, a implementation designed for embedded systems, which provides basic tar functionality in a minimal footprint under 1 MB to support resource-constrained environments like IoT devices. Python's tarfile module offers a programmatic interface for scripting tar operations, supporting reading and writing of POSIX.1-2001 compliant archives with built-in handling for , , and lzma compression, making it ideal for automated tasks in cross-platform applications. Key differences among these implementations lie in their approach to standards and extensions: GNU tar favors liberal enhancements for usability on , potentially reducing portability, while BSD tar maintains conservatism for broad compatibility, and prioritizes integrity with specialized media features. The following table summarizes format compatibility based on portability tests:
Feature/FormatGNU tarBSD tar (libarchive)Star (Schily)
POSIX UStarFullFullFull
POSIX PaxFullFullFull
GNU ExtensionsNativePartial (reads, rejects some)Partial
Star/SCHILY KeysPartialReads mostNative
ACL SupportVia xattrVia pax extensionsNative (exustar)
Rock RidgeNoPartial (ISO read)Full
This table highlights that while all support core formats, proprietary extensions can cause issues, with BSD tar offering the widest read support across variants.

Compressed Archive Suffixes

Compressed tar archives employ standardized filename extensions to denote the compression algorithm applied to the underlying .tar file, enabling easy identification and automated processing in tools like GNU tar. These conventions arose from common practices in systems, where the base .tar extension signifies an uncompressed archive, while appended suffixes indicate compression for reduced storage and transmission efficiency. The following table summarizes the primary standard extensions:
ExtensionCompression MethodNotes
.tarNone (uncompressed)Basic tarball format.
.tar.gz, .tgzWidely used for its balance of speed and compression ratio; .tgz is a common shorthand.
.tar.bz2, .tbzOffers better compression than gzip at the cost of slower processing.
.tar.xzxz (LZMA)Provides high compression ratios, suitable for large archives.
.tar.zst (Zstandard)High compression ratios with good speed; supported natively in GNU tar and other modern tools.
.tar.Zcompress (legacy)Older Unix compression method, less efficient and rarely used today.
.tar.lzEmploys LZMA-like compression with emphasis on data integrity and error recovery.
GNU tar supports automatic detection and selection of the appropriate compression program via the --auto-compress option, which examines the archive's filename suffix to determine the format during creation or extraction. This feature streamlines workflows by eliminating the need to specify compression flags explicitly, provided the suffix matches one of the recognized patterns such as those listed above. Variations exist beyond these standards, including .tar.lzma for direct LZMA compression, which is supported in some tools but not as universally as .tar.xz. Non-standard combinations like .tar.7z, which apply compression to a tar archive, are occasionally used but lack broad tool support and are not recommended for interoperability. Platform-specific extensions, such as .taz on systems for archives compressed with the legacy compress utility, further illustrate historical adaptations. These compressed tar files can be generated via piping, for example, by streaming tar output directly to a compressor like .

References

Add your contribution
Related Hubs
User Avatar
No comments yet.