Hubbry Logo
ParchiveParchiveMain
Open search
Parchive
Community hub
Parchive
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Parchive
Parchive
from Wikipedia
Parchive
Filename extension
.par, .par2, .p??, (.par3 future)
Type of formatErasure code, archive file

Parchive (a portmanteau of parity archive, and formally known as Parity Volume Set Specification[1][2]) is an erasure code system that produces par files for checksum verification of data integrity, with the capability to perform data recovery operations that can repair or regenerate corrupted or missing data.

Parchive was originally written to solve the problem of reliable file sharing on Usenet,[3] but it can be used for protecting any kind of data from data corruption, disc rot, bit rot, and accidental or malicious damage. Despite the name, Parchive uses more advanced techniques (specifically error correction codes) than simplistic parity methods of error detection.

As of 2014, PAR1 is obsolete, PAR2 is mature for widespread use, and PAR3 is a discontinued experimental version developed by MultiPar author Yutaka Sawada.[4][5][6][7] The original SourceForge Parchive project has been inactive since April 30, 2015.[8] A new PAR3 specification has been worked on since April 28, 2019 by PAR2 specification author Michael Nahas. An alpha version of the PAR3 specification has been published on January 29, 2022[9] while the program itself is being developed.

History

[edit]

Parchive was intended to increase the reliability of transferring files via Usenet newsgroups. Usenet was originally designed for informal conversations, and the underlying protocol, NNTP was not designed to transmit arbitrary binary data. Another limitation, which was acceptable for conversations but not for files, was that messages were normally fairly short in length and limited to 7-bit ASCII text.[10]

Various techniques were devised to send files over Usenet, such as uuencoding and Base64. Later Usenet software allowed 8 bit Extended ASCII, which permitted new techniques like yEnc. Large files were broken up to reduce the effect of a corrupted download, but the unreliable nature of Usenet remained.

With the introduction of Parchive, parity files could be created that were then uploaded along with the original data files. If any of the data files were damaged or lost while being propagated between Usenet servers, users could download parity files and use them to reconstruct the damaged or missing files. Parchive included the construction of small index files (*.par in version 1 and *.par2 in version 2) that do not contain any recovery data. These indexes contain file hashes that can be used to quickly identify the target files and verify their integrity.

Because the index files were so small, they minimized the amount of extra data that had to be downloaded from Usenet to verify that the data files were all present and undamaged, or to determine how many parity volumes were required to repair any damage or reconstruct any missing files. They were most useful in version 1 where the parity volumes were much larger than the short index files. These larger parity volumes contain the actual recovery data along with a duplicate copy of the information in the index files (which allows them to be used on their own to verify the integrity of the data files if there is no small index file available).

In July 2001, Tobias Rieper and Stefan Wehlus proposed the Parity Volume Set specification, and with the assistance of other project members, version 1.0 of the specification was published in October 2001.[11] Par1 used Reed–Solomon error correction to create new recovery files. Any of the recovery files can be used to rebuild a missing file from an incomplete download.

Version 1 became widely used on Usenet, but it did suffer some limitations:

  • It was restricted to handle at most 255 files.
  • The recovery files had to be the size of the largest input file, so it did not work well when the input files were of various sizes. (This limited its usefulness when not paired with the proprietary RAR compression tool.)
  • The recovery algorithm had a bug, due to a flaw[12] in the academic paper[13] on which it was based.
  • It was strongly tied to Usenet and it was felt that a more general tool might have a wider audience.

In January 2002, Howard Fukada proposed that a new Par2 specification should be devised with the significant changes that data verification and repair should work on blocks of data rather than whole files, and that the algorithm should switch to using 16 bit numbers rather than the 8 bit numbers that PAR1 used. Michael Nahas and Peter Clements took up these ideas in July 2002, with additional input from Paul Nettle and Ryan Gallagher (who both wrote Par1 clients). Version 2.0 of the Parchive specification was published by Michael Nahas in September 2002.[14]

Peter Clements then went on to write the first two Par2 implementations, QuickPar and par2cmdline. Abandoned since 2004, Paul Houle created phpar2 to supersede par2cmdline. Yutaka Sawada created MultiPar to supersede QuickPar. MultiPar uses par2j.exe (which is partially based on par2cmdline's optimization techniques) to use as MultiPar's backend engine.

Versions

[edit]

Versions 1 and 2 of the file format are incompatible. (However, many clients support both.)

Par1

[edit]

For Par1, the files f1, f2, ..., fn, the Parchive consists of an index file (f.par), which is CRC type file with no recovery blocks, and a number of "parity volumes" (f.p01, f.p02, etc.). Given all of the original files except for one (for example, f2), it is possible to create the missing f2 given all of the other original files and any one of the parity volumes. Alternatively, it is possible to recreate two missing files from any two of the parity volumes and so forth.[15]

Par1 supports up to a total of 256 source and recovery files.

Par2

[edit]

Par2 files generally use this naming/extension system: filename.vol000+01.PAR2, filename.vol001+02.PAR2, filename.vol003+04.PAR2, filename.vol007+06.PAR2, etc. The number after the "+" in the filename indicates how many blocks it contains, and the number after "vol" indicates the number of the first recovery block within the PAR2 file. If an index file of a download states that 4 blocks are missing, the easiest way to repair the files would be by downloading filename.vol003+04.PAR2. However, due to the redundancy, filename.vol007+06.PAR2 is also acceptable. There is also an index file filename.PAR2, it is identical in function to the small index file used in PAR1.

Par2 specification supports up to 32,768 source blocks and up to 65,535 recovery blocks. Input files are split into multiple equal-sized blocks so that recovery files do not need to be the size of the largest input file.

Although Unicode is mentioned in the PAR2 specification as an option, most PAR2 implementations do not support Unicode.

Directory support is included in the PAR2 specification, but most or all implementations do not support it.

Par3

[edit]

The Par3 specification was originally planned to be published as an enhancement over the Par2 specification. However, to date,[when?] it has remained closed source by specification owner Yutaka Sawada.

A discussion on a new format started in the GitHub issue section of the maintained fork par2cmdline on January 29, 2019. The discussion led to a new format which is also named as Par3. The new Par3 format's specification is published on GitHub, but remains being an alpha draft as of January 28, 2022. The specification is written by Michael Nahas, the author of Par2 specification, with the help from Yutaka Sawada, animetosho and malaire.

The new format claims to have multiple advantages over the Par2 format, including support for:

  • More than 216 files and more than 216 blocks.
  • Packing small files into one block, as well as deduplication when a block appears in multiple files.
  • UTF-8 file names.
  • File permissions, hard links, symbolic/soft links, and empty directories.
  • Embedding PAR data inside other formats, like ZIP archives or ISO disk images.
  • "Incremental backups", where a user creates recovery files for some file or folder, change some data, and create new recovery files reusing some of the older files.
  • More error correction code algorithms (such as LDPC and sparse random matrix).
  • BLAKE3 hashes, dropping support for the MD5 hashes used in PAR2.

Software

[edit]

Multi-platform

[edit]

Windows

[edit]
  • MultiPar (freeware)  — Builds upon QuickPar's features and GUI, and uses Yutaka Sawada's par2j.exe as the PAR2 backend. MultiPar supports multiple languages by Unicode. The name of MultiPar was derived from "multi-lingual PAR client". MultiPar is also verified to work with Wine under TrueOS and Ubuntu, and may work with other operating systems too.[16][17] Although the Par2 components are (or will be) open source, the MultiPar GUI on top of them is currently not open source.[18]
  • QuickPar (freeware) — unmaintained since 2004, superseded by MultiPar.
  • phpar2  — advanced par2cmdline with multithreading and highly optimized assemblercode (about 66% faster than QuickPar 0.9.1)
  • Mirror — First PAR implementation, unmaintained since 2001.

Mac OS X

[edit]

POSIX

[edit]

Software for POSIX conforming operating systems:

See also

[edit]
  • Comparison of file archivers – Some file archivers are capable of integrating parity data into their formats for error detection and correction:
  • RAID – RAID levels at and above RAID 5 make use of parity data to detect and repair errors.

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Parchive is a redundant and associated software tools designed to detect and repair or loss in sets of files, functioning similarly to redundancy but for individual or grouped files during transmission or storage. It primarily employs Reed-Solomon error-correcting codes to generate parity volumes that allow recovery of missing or damaged data blocks without needing the original files to be retransmitted. Originally developed to address reliability issues in , where binary posts were often corrupted due to encoding errors or incomplete downloads, Parchive has become a for protecting large file transfers and backups on optical media or tapes. The format evolved from an initial version (PAR 1.0) released around 2001 by the Parchive Project, a collaborative effort to solve Usenet-specific data integrity problems. A draft of the version 2.0 (PAR 2.0) specification was published by Michael Nahas in 2002, with the final version released on May 11, 2003, which introduced improvements like support for multiple recovery blocks and MD5 hashing for verification, enabling more efficient repair of larger datasets. This version uses the .par2 file extension and is implemented in command-line tools such as par2cmdline, an open-source tool developed by the Parchive team, with millions of downloads across platforms including Linux, Windows, and macOS. Parchive's open-source nature has led to integrations in newsreaders and archival software, and as of 2025, an alpha draft of version 3.0 specification exists, promising enhanced capabilities like stronger hashing algorithms, though no reference implementation has been released yet.

Introduction

Definition and Purpose

Parchive is an open-source system that applies erasure coding techniques to generate redundant parity files, allowing users to verify the of and recover from or loss without needing the complete original set. Developed as a solution for multi-part archives, it creates additional files containing parity information derived from the source , enabling partial reconstruction even if some segments are missing or damaged. The primary purpose of Parchive is to facilitate the reconstruction of original files using Reed-Solomon error-correcting codes, particularly in scenarios where data portions are damaged, deleted, or lost during transmission or long-term storage. This approach addresses common issues in distributed environments by providing a mechanism to repair incomplete or corrupted datasets efficiently. Key benefits of Parchive include enhanced reliability for applications, such as posting multi-part archives on , where data loss is frequent due to network propagation and retention policies. It also supports archival backups by adding without the overhead of full file duplication, thereby optimizing storage and recovery processes. The system was initially proposed in July 2001 specifically to improve file reliability through parity-based recovery.

Core Principles

Parchive's core principles revolve around the creation of parity volumes, which serve as sets of redundant blocks derived from original input files to enable recovery of lost or damaged portions. These parity blocks are designed to reconstruct up to a user-specified number of missing blocks, providing robust error correction without requiring the original files to be altered. The leverages Reed-Solomon codes as its foundational mechanism for generating this redundancy, allowing for systematic repair in environments prone to incomplete transfers, such as or archival storage. The operational process begins with dividing the input files into equal-sized data blocks, ensuring uniformity for computational efficiency. Recovery blocks are then calculated from these data blocks using algebraic encoding, producing additional files that encapsulate the necessary redundancy. A critical index file accompanies the parity volumes, storing metadata such as block counts, file hashes for verification, and repair instructions, which facilitates both integrity checks and automated reconstruction. This metadata-driven approach ensures that recovery can occur independently of the creation environment. Block sizes in Parchive are user-selectable as multiples of 4 bytes in the PAR 2.0 specification to balance overhead and performance across common file systems. Recovery capacity is capped at up to 32,768 blocks per set in this format, allowing for substantial protection against data loss while maintaining computational feasibility. In contrast to simple checksum mechanisms, which merely detect corruption without enabling repair, Parchive's parity volumes support full data reconstruction by solving for erased blocks through the redundant information, making it suitable for proactive data preservation.

History

Origins and Early Development

Parchive emerged in response to prevalent data corruption issues in Usenet file postings, where binary files often arrived incomplete or damaged due to transmission errors across decentralized news servers. In July 2001, Tobias Rieper proposed the initial concept for a parity-based recovery system, with Stefan Wehlus (also known as Beaker) drafting the first specification on July 12, 2001, to enable reliable reconstruction of missing or corrupted data blocks without relying on proprietary software. This effort was driven by the need for an open, accessible tool that applied RAID-like redundancy principles to Usenet's multi-part binary distributions, allowing users to verify and repair files using forward error correction techniques. The Parity Volume Set specification reached version 1.0 on , 2001, marking the first stable implementation of PAR, which generated parity files to protect sets of volumes. Initial design goals focused on simplicity and efficiency, prioritizing a lightweight system that could create recovery volumes for binary files while minimizing computational overhead and ensuring compatibility with standard posting practices, all under an open-source model free from licensing restrictions. Contributions from early project members, including Kilroy Balore and Monsuwe, refined the format to handle common scenarios effectively. Following its release, PAR 1.0 saw rapid early adoption within communities, particularly for safeguarding RAR and ZIP archives that were staples of sharing, as users integrated parity files into postings to mitigate losses from incomplete downloads or server expirations. This uptake addressed a critical pain point in file distribution, fostering greater reliability for large-scale exchanges. The format's evolution to PAR 2.0 later expanded these capabilities to support more flexible recovery options.

Key Milestones and Contributors

Following the initial proposal of the Parity Volume Set specification in 2001, key developments in Parchive centered on enhancing the format's capabilities to address limitations in handling larger files and improving repair efficiency. Stefan Wehlus and Tobias Rieper played pivotal roles in drafting the initial PAR 1.0 specification, with additional contributions from Roger Harrison (Kilroy Balore), Willem Monsuwe, Karl Vogel, and Ryan Gallagher. A major milestone occurred with the release of the PAR 2.0 specification, initially drafted by Michael Nahas in July 2002, and finalized on May 11, 2003, in collaboration with Peter Clements, Paul Nettle, and Ryan Gallagher. This version significantly improved scalability by supporting up to 32,768 data blocks—compared to PAR 1.0's limit of 255—and enhanced efficiency through optimized Reed-Solomon coding adapted from established implementations. PAR 2.0 quickly became the for file verification and repair, superseding PAR 1.0 due to the latter's constraints on file sizes and block counts. Efforts to develop PAR 3.0 began with a draft proposal in 2010 by Yutaka Sawada, which was tested in the MultiPar client but never officially released. Michael Nahas revived the initiative in 2019, leading to an alpha draft specification on January 28, 2022, incorporating ideas from Sawada, animetosho, and malaire to add features like deduplication and support for files exceeding 2^64 bytes; however, as of 2025, no completed PAR 3.0 release has materialized. The Parchive project, originally hosted on , migrated to around 2019 to facilitate ongoing maintenance and community contributions, with the original site becoming inactive thereafter. Michael Nahas has remained a central figure, authoring the PAR 2.0 specification and leading PAR 3.0 drafts, while Clements contributed early PAR 2.0 implementations. As of 2025, PAR 2.0 continues to serve as the widely adopted standard.

File Format Versions

PAR 1.0

PAR 1.0, the initial version of the Parchive file format, utilizes specific file extensions to organize its index and parity data. The primary index file employs the .par extension, while individual parity volumes use extensions such as .p01, .p02, and so on, up to .p99, followed by .q00 if more are needed. The creation process in PAR 1.0 generates parity data for an entire set of input files by computing Reed-Solomon parity symbols byte-by-byte across all files, padding shorter files with zeros to match the length of the largest file. This results in parity volumes whose size equals that of the largest input file, with the index file containing metadata such as filenames, sizes, MD5 hashes for verification, and a set hash. However, this approach lacks per-file metadata granularity, treating the file set as a unified stream without individualized recovery details for separate files. PAR 1.0 imposes several key limitations stemming from its use of an 8-bit Reed-Solomon code over GF(256). It supports a maximum of fewer than 256 total files and parity volumes combined, allowing only up to 255 recovery blocks. Additionally, while there is no explicit fixed block size limit documented, the byte-level processing effectively constrains handling to sets where the largest remains manageable, as parity computation scales directly with it. PAR 1.0 is considered obsolete primarily due to its inadequacy for large modern files, where parity volumes become impractically large, and its inefficiency for multi-file sets, which suffer from excessive padding and limited scalability. These shortcomings were addressed in subsequent versions like PAR 2.0, which introduced block-based processing for better efficiency.

PAR 2.0

PAR 2.0 represents a significant advancement in the Parchive format, introducing enhanced and efficiency for in large archives. The format uses the file extension .par2 for the primary index file, which contains essential metadata such as file descriptions and checksums, while recovery volumes are denoted by extensions like .volXX-YY.par2 (e.g., .vol01-04.par2), allowing for segmented storage of parity . This structure supports multi-volume sets, where recovery can be distributed across multiple files to manage large datasets without exceeding file size limits imposed by systems like . Key enhancements in PAR 2.0 include support for up to 32,768 blocks in total (source + recovery), enabling the protection of extensive file sets that PAR 1.0 could not handle due to its 8-bit limitations. Block sizes are variable and must be multiples of 4 bytes, chosen based on the archive's needs. Additionally, per-file hashing provides robust verification, computing a 16-byte hash from the file name, length, and the of the first 16 KB of content, ensuring precise identification and repair of individual files. These features allow partial recovery, where the format can reconstruct data as long as the number of available recovery blocks exceeds the lost or damaged source blocks. For file identification, PAR 2.0 employs a magic number of 50 41 52 32 00 50 4B 54 in (equivalent to "PAR2\0PKT" in ASCII), which appears at the start of each packet header to confirm version and facilitate . As of 2025, PAR 2.0 remains actively used, particularly in communities for archiving and distributing large binary files, where its mature implementation continues to provide reliable error correction without the experimental changes proposed in PAR 3.0 drafts.

PAR 3.0

PAR 3.0 represents an experimental draft specification aimed at addressing limitations in earlier Parchive versions, particularly for handling contemporary data storage needs. Proposed in its current form as an alpha draft released on January 28, 2022, by Michael Nahas—the original author of the PAR 2.0 specification—it incorporates ideas from contributors including Yutaka Sawada, animetosho, and malaire. This draft builds on an earlier, abandoned proposal from 2010 developed by Yutaka Sawada and others, which included a test implementation but was discontinued due to lack of progress. An initial effort to revive development began in 2019 through discussions on the par2cmdline GitHub repository, leading to the 2022 draft that updates Reed-Solomon parameters for improved error correction flexibility, such as support for any Galois field and multiple linear codes beyond traditional Reed-Solomon. Key proposed extensions in the PAR 3.0 draft include new file extensions of .par3 for generated recovery files, enabling support for datasets exceeding the block count limitations of PAR 2.0 (which caps at 32,768 blocks total, restricting total to around 64 MB with a 2 KB block size). It introduces 64-bit integers for block sizes and indices, allowing handling of individual files up to 2^64 - 1 bytes and accommodating future storage expansions without inherent size constraints. Additional enhancements encompass deduplication to reuse identical blocks across files, reducing requirements; incremental capabilities that leverage prior PAR 3.0 recovery ; and "Par inside" embedding to protect compressed archives like ZIP files by integrating recovery packets directly. While the format does not include built-in compression, it recommends external tools such as with rsyncable options for better integration, and its matrix-based recovery design supports optimizations for multi-threaded repair processes, including parallel for inversion. Despite these advancements, the PAR 3.0 draft faces significant challenges as of 2025, including the absence of a complete reference implementation, with the par3cmdline project remaining in development without stable releases or widespread testing. Compatibility issues persist, as existing PAR 2.0 tools cannot process PAR 3.0 files, and the draft prioritizes new features over full backward compatibility with PAR 2.0 structures, potentially requiring separate clients for legacy support. The specification briefly notes mechanisms for reading PAR 2.0 files in new clients, but this is not a core focus. Potential benefits of PAR 3.0 include enhanced efficiency for modern applications such as backups and large-scale hoarding, where deduplication and larger file support could minimize storage overhead and improve recovery speeds on multi-core systems. However, with no finalized standard or broad adoption by —evidenced by ongoing open issues in the par3cmdline repository and limited mentions in preservation formats —the format remains experimental and unused in production environments.

Technical Specifications

Reed-Solomon Error Correction

Reed-Solomon codes are a class of block-based error-correcting codes that operate over finite fields, known as Galois fields, to enable the detection and correction of errors in data transmission or storage. In Parchive, these codes are implemented over the finite field GF(2^{16}), which consists of 65,536 elements represented as 16-bit words, allowing efficient computation using bitwise operations on words. This field choice facilitates the handling of word-sized symbols, aligning with typical file block sizes in scenarios. The fundamental structure of a Reed-Solomon code is denoted as RS(n, k), where n is the total number of symbols in a codeword ( plus parity), and k is the number of symbols. The code generates m = n - k parity symbols, enabling correction of up to \lfloor m/2 \rfloor symbol errors in the general case, or up to m erasures (errors with known positions) when positions are identifiable. In mathematical terms, the error-correcting capability t satisfies 2t \leq m for errors, but extends to t \leq m for erasures, as the known locations reduce the needed for recovery. This property is derived from the code's minimum distance d = n - k + 1, which bounds the number of correctable errors or erasures. In Parchive, particularly in the PAR2 format, Reed-Solomon codes are applied to recover lost data blocks treated as erasures, where m recovery blocks permit the reconstruction of up to m missing data blocks, provided the total number of blocks n = k + m does not exceed 65,536. This setup ensures that as long as the number of lost blocks does not exceed the available recovery blocks, full data restoration is possible through systematic encoding that preserves the original data while appending parity information. The encoding process in Parchive computes parity blocks as linear combinations of the data blocks over GF(2^{16}). Each recovery block is formed by multiplying data block symbols by predefined coefficients—typically powers of a primitive element in the field—and summing them the field characteristic. Formally, if the data blocks are vectors d1,d2,,dk\mathbf{d}_1, \mathbf{d}_2, \dots, \mathbf{d}_k in GF(2^{16})^s (where s is the block length in 16-bit symbols), the i-th parity block pi\mathbf{p}_i is given by pi=j=1kα(i1)(j1)dj\mathbf{p}_i = \sum_{j=1}^k \alpha^{ (i-1)(j-1) } \mathbf{d}_j, where α\alpha is a primitive root of the field. This Vandermonde-based construction ensures the parity matrix is invertible, allowing recovery. Decoding in Parchive leverages the erasure-aware nature of lost blocks by first identifying syndromes to quantify discrepancies. The syndrome vector s\mathbf{s} is computed as si=j=1nrjαij\mathbf{s}_i = \sum_{j=1}^n r_j \alpha^{i j}, where r\mathbf{r} are the received (possibly incomplete) symbols and non-erased positions are used. For erasures, the known positions simplify the process: the Berlekamp-Massey algorithm is employed to derive the error locator polynomial Λ(x)\Lambda(x), which identifies error (or erasure) positions by finding roots of Λ(x)=0\Lambda(x) = 0. Error magnitudes are then solved using the Forney formula or equivalent, yielding el=s(x)Ω(x)Λ(βl)xvx=βl1\mathbf{e}_l = -\frac{\mathbf{s}(x) \Omega(x)}{\Lambda'( \beta_l ) x^{v} } \big|_{x=\beta_l^{-1}}
Add your contribution
Related Hubs
User Avatar
No comments yet.