Recent from talks
Nothing was collected or created yet.
Parchive
View on Wikipedia| Parchive | |
|---|---|
| Filename extension |
.par, .par2, .p??, (.par3 future) |
| Type of format | Erasure code, archive file |
Parchive (a portmanteau of parity archive, and formally known as Parity Volume Set Specification[1][2]) is an erasure code system that produces par files for checksum verification of data integrity, with the capability to perform data recovery operations that can repair or regenerate corrupted or missing data.
Parchive was originally written to solve the problem of reliable file sharing on Usenet,[3] but it can be used for protecting any kind of data from data corruption, disc rot, bit rot, and accidental or malicious damage. Despite the name, Parchive uses more advanced techniques (specifically error correction codes) than simplistic parity methods of error detection.
As of 2014, PAR1 is obsolete, PAR2 is mature for widespread use, and PAR3 is a discontinued experimental version developed by MultiPar author Yutaka Sawada.[4][5][6][7] The original SourceForge Parchive project has been inactive since April 30, 2015.[8] A new PAR3 specification has been worked on since April 28, 2019 by PAR2 specification author Michael Nahas. An alpha version of the PAR3 specification has been published on January 29, 2022[9] while the program itself is being developed.
History
[edit]Parchive was intended to increase the reliability of transferring files via Usenet newsgroups. Usenet was originally designed for informal conversations, and the underlying protocol, NNTP was not designed to transmit arbitrary binary data. Another limitation, which was acceptable for conversations but not for files, was that messages were normally fairly short in length and limited to 7-bit ASCII text.[10]
Various techniques were devised to send files over Usenet, such as uuencoding and Base64. Later Usenet software allowed 8 bit Extended ASCII, which permitted new techniques like yEnc. Large files were broken up to reduce the effect of a corrupted download, but the unreliable nature of Usenet remained.
With the introduction of Parchive, parity files could be created that were then uploaded along with the original data files. If any of the data files were damaged or lost while being propagated between Usenet servers, users could download parity files and use them to reconstruct the damaged or missing files. Parchive included the construction of small index files (*.par in version 1 and *.par2 in version 2) that do not contain any recovery data. These indexes contain file hashes that can be used to quickly identify the target files and verify their integrity.
Because the index files were so small, they minimized the amount of extra data that had to be downloaded from Usenet to verify that the data files were all present and undamaged, or to determine how many parity volumes were required to repair any damage or reconstruct any missing files. They were most useful in version 1 where the parity volumes were much larger than the short index files. These larger parity volumes contain the actual recovery data along with a duplicate copy of the information in the index files (which allows them to be used on their own to verify the integrity of the data files if there is no small index file available).
In July 2001, Tobias Rieper and Stefan Wehlus proposed the Parity Volume Set specification, and with the assistance of other project members, version 1.0 of the specification was published in October 2001.[11] Par1 used Reed–Solomon error correction to create new recovery files. Any of the recovery files can be used to rebuild a missing file from an incomplete download.
Version 1 became widely used on Usenet, but it did suffer some limitations:
- It was restricted to handle at most 255 files.
- The recovery files had to be the size of the largest input file, so it did not work well when the input files were of various sizes. (This limited its usefulness when not paired with the proprietary RAR compression tool.)
- The recovery algorithm had a bug, due to a flaw[12] in the academic paper[13] on which it was based.
- It was strongly tied to Usenet and it was felt that a more general tool might have a wider audience.
In January 2002, Howard Fukada proposed that a new Par2 specification should be devised with the significant changes that data verification and repair should work on blocks of data rather than whole files, and that the algorithm should switch to using 16 bit numbers rather than the 8 bit numbers that PAR1 used. Michael Nahas and Peter Clements took up these ideas in July 2002, with additional input from Paul Nettle and Ryan Gallagher (who both wrote Par1 clients). Version 2.0 of the Parchive specification was published by Michael Nahas in September 2002.[14]
Peter Clements then went on to write the first two Par2 implementations, QuickPar and par2cmdline. Abandoned since 2004, Paul Houle created phpar2 to supersede par2cmdline. Yutaka Sawada created MultiPar to supersede QuickPar. MultiPar uses par2j.exe (which is partially based on par2cmdline's optimization techniques) to use as MultiPar's backend engine.
Versions
[edit]Versions 1 and 2 of the file format are incompatible. (However, many clients support both.)
Par1
[edit]For Par1, the files f1, f2, ..., fn, the Parchive consists of an index file (f.par), which is CRC type file with no recovery blocks, and a number of "parity volumes" (f.p01, f.p02, etc.). Given all of the original files except for one (for example, f2), it is possible to create the missing f2 given all of the other original files and any one of the parity volumes. Alternatively, it is possible to recreate two missing files from any two of the parity volumes and so forth.[15]
Par1 supports up to a total of 256 source and recovery files.
Par2
[edit]Par2 files generally use this naming/extension system: filename.vol000+01.PAR2, filename.vol001+02.PAR2, filename.vol003+04.PAR2, filename.vol007+06.PAR2, etc. The number after the "+" in the filename indicates how many blocks it contains, and the number after "vol" indicates the number of the first recovery block within the PAR2 file. If an index file of a download states that 4 blocks are missing, the easiest way to repair the files would be by downloading filename.vol003+04.PAR2. However, due to the redundancy, filename.vol007+06.PAR2 is also acceptable. There is also an index file filename.PAR2, it is identical in function to the small index file used in PAR1.
Par2 specification supports up to 32,768 source blocks and up to 65,535 recovery blocks. Input files are split into multiple equal-sized blocks so that recovery files do not need to be the size of the largest input file.
Although Unicode is mentioned in the PAR2 specification as an option, most PAR2 implementations do not support Unicode.
Directory support is included in the PAR2 specification, but most or all implementations do not support it.
Par3
[edit]The Par3 specification was originally planned to be published as an enhancement over the Par2 specification. However, to date,[when?] it has remained closed source by specification owner Yutaka Sawada.
A discussion on a new format started in the GitHub issue section of the maintained fork par2cmdline on January 29, 2019. The discussion led to a new format which is also named as Par3. The new Par3 format's specification is published on GitHub, but remains being an alpha draft as of January 28, 2022. The specification is written by Michael Nahas, the author of Par2 specification, with the help from Yutaka Sawada, animetosho and malaire.
The new format claims to have multiple advantages over the Par2 format, including support for:
- More than 216 files and more than 216 blocks.
- Packing small files into one block, as well as deduplication when a block appears in multiple files.
- UTF-8 file names.
- File permissions, hard links, symbolic/soft links, and empty directories.
- Embedding PAR data inside other formats, like ZIP archives or ISO disk images.
- "Incremental backups", where a user creates recovery files for some file or folder, change some data, and create new recovery files reusing some of the older files.
- More error correction code algorithms (such as LDPC and sparse random matrix).
- BLAKE3 hashes, dropping support for the MD5 hashes used in PAR2.
Software
[edit]Multi-platform
[edit]- par2+tbb (GPLv2) — a concurrent (multithreaded) version of par2cmdline 0.4 using TBB. Only compatible with x86 based CPUs. It is available in the FreeBSD Ports system as par2cmdline-tbb.
- Original par2cmdline — (obsolete). Available in the FreeBSD Ports system as par2cmdline.
- par2cmdline maintained fork by BlackIkeEagle.
- par2cmdline-mt is another multithreaded version of par2cmdline using OpenMP, GPLv2, or later. Currently merged into BlackIkeEagle's fork and maintained there.
- ParPar (CC0) is a high performance, multithreaded PAR2 client and Node.js library. Does not support verifying or repair, it can currently only create PAR2 archives.
- par2deep (LGPL-3.0) — Produce, verify and repair par2 files recursively, both on the command line as well as with the aid of a graphical user interface. It is available in the Python Package Index system as par2deep.
Windows
[edit]- MultiPar (freeware) — Builds upon QuickPar's features and GUI, and uses Yutaka Sawada's par2j.exe as the PAR2 backend. MultiPar supports multiple languages by Unicode. The name of MultiPar was derived from "multi-lingual PAR client". MultiPar is also verified to work with Wine under TrueOS and Ubuntu, and may work with other operating systems too.[16][17] Although the Par2 components are (or will be) open source, the MultiPar GUI on top of them is currently not open source.[18]
- QuickPar (freeware) — unmaintained since 2004, superseded by MultiPar.
- phpar2 — advanced par2cmdline with multithreading and highly optimized assemblercode (about 66% faster than QuickPar 0.9.1)
- Mirror — First PAR implementation, unmaintained since 2001.
Mac OS X
[edit]POSIX
[edit]Software for POSIX conforming operating systems:
- Par2 for KDE 4
- PyPar2 1.4, a frontend for par2.
- GPar2 2.03
See also
[edit]- Comparison of file archivers – Some file archivers are capable of integrating parity data into their formats for error detection and correction:
- RAID – RAID levels at and above RAID 5 make use of parity data to detect and repair errors.
References
[edit]- ^ Re: Correction to Parchive on Wikipedia, reply #3, by Yutaka Sawada: "Their formal title are "Parity Volume Set Specification 1.0" and "Parity Volume Set Specification 2.0."
- ^ Re: Correction to Parchive on Wikipedia, reply #3, by Yutaka Sawada: "Their formal title are "Parity Volume Set Specification 1.0" and "Parity Volume Set Specification 2.0."
- ^ "Parchive: Parity Archive Volume Set". Retrieved 2009-10-29.
The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet.
- ^ "possibility of new PAR3 file". Archived from the original on 2012-07-07. Retrieved 2012-07-01.
- ^ "Question about your usage of PAR3". Archived from the original on 2014-03-09. Retrieved 2012-07-01.
- ^ "Risk of undetectable intended modification". Archived from the original on 2014-03-09. Retrieved 2012-07-01.
- ^ "PAR3 specification proposal not finished as of April 2011". Archived from the original on 2014-03-09. Retrieved 2012-07-01.
- ^ "Parchive: Parity Archive Tool". 30 April 2015. Retrieved 2020-05-20.
- ^ "Parity Volume Set Specification 3.0 [2022-01-28 ALPHA DRAFT]". Michael Nahas, Yutaka-Sawada, animetosho, and malaire.
- ^ Kantor, Brian; Lapsley, Phil (February 1986). "Character Codes". Network News Transfer Protocol. IETF. p. 5. sec. 2.2. doi:10.17487/RFC0977. RFC 977. Retrieved 2009-10-29.
- ^ Nahas, Michael (2001-10-14). "Parity Volume Set Specification v1.0". Retrieved 2017-06-19.
- ^ Plank, James S.; Ding, Ying (April 2003). "Note: Correction to the 1997 Tutorial on Reed-Solomon Coding". Retrieved 2009-10-29.
- ^ Plank, James S. (September 1997). "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems". Retrieved 2009-10-29.
- ^ Nahas, Michael; Clements, Peter; Nettle, Paul; Gallagher, Ryan (2003-05-11). "Parity Volume Set Specification 2.0". Retrieved 2009-10-29.
- ^ Wang, Wallace (2004-10-25). "Finding movies (or TV shows): Recovering missing RAR files with PAR and PAR2 files". Steal this File Sharing Book (1st ed.). San Francisco, California: No Starch Press. pp. 164–167. ISBN 978-1-59327-050-6. Retrieved 2009-09-24.
- ^ "MultiPar works with PCBSD 9.0". Archived from the original on 2013-09-28. Retrieved 2012-02-27.
- ^ Working on Ubuntu 18.04 via wine[dead link]
- ^ "contacted you, asking about sourcecode". Archived from the original on 2013-09-26. Retrieved 2013-09-21.
External links
[edit]- Parity Volume Set Specification 2.0 (2003)
- Parchive project - full specifications and math behind it
- Introduction to PAR and PAR2
- Slyck's Guide To The Usenet Newsgroups: PAR & PAR2 Files Archived 2009-10-05 at the Wayback Machine
- Guide to repair files using PAR2
- UsenetReviewz's guide to opening par files
Parchive
View on GrokipediaIntroduction
Definition and Purpose
Parchive is an open-source system that applies erasure coding techniques to generate redundant parity files, allowing users to verify the integrity of data and recover from corruption or loss without needing the complete original set.[5] Developed as a solution for multi-part archives, it creates additional files containing parity information derived from the source data, enabling partial reconstruction even if some segments are missing or damaged.[2] The primary purpose of Parchive is to facilitate the reconstruction of original files using Reed-Solomon error-correcting codes, particularly in scenarios where data portions are damaged, deleted, or lost during transmission or long-term storage.[6] This approach addresses common issues in distributed file sharing environments by providing a mechanism to repair incomplete or corrupted datasets efficiently.[5] Key benefits of Parchive include enhanced reliability for file sharing applications, such as posting multi-part archives on Usenet, where data loss is frequent due to network propagation and retention policies.[2] It also supports archival backups by adding redundancy without the overhead of full file duplication, thereby optimizing storage and recovery processes.[5] The system was initially proposed in July 2001 specifically to improve Usenet file reliability through parity-based recovery.[2]Core Principles
Parchive's core principles revolve around the creation of parity volumes, which serve as sets of redundant data blocks derived from original input files to enable recovery of lost or damaged portions. These parity blocks are designed to reconstruct up to a user-specified number of missing data blocks, providing robust error correction without requiring the original files to be altered.[6] The system leverages Reed-Solomon codes as its foundational mechanism for generating this redundancy, allowing for systematic data repair in environments prone to incomplete transfers, such as Usenet or archival storage.[6] The operational process begins with dividing the input files into equal-sized data blocks, ensuring uniformity for computational efficiency. Recovery blocks are then calculated from these data blocks using algebraic encoding, producing additional files that encapsulate the necessary redundancy. A critical index file accompanies the parity volumes, storing metadata such as block counts, file hashes for verification, and repair instructions, which facilitates both integrity checks and automated reconstruction.[6] This metadata-driven approach ensures that recovery can occur independently of the creation environment. Block sizes in Parchive are user-selectable as multiples of 4 bytes in the PAR 2.0 specification to balance overhead and performance across common file systems. Recovery capacity is capped at up to 32,768 blocks per set in this format, allowing for substantial protection against data loss while maintaining computational feasibility.[6] In contrast to simple checksum mechanisms, which merely detect corruption without enabling repair, Parchive's parity volumes support full data reconstruction by solving for erased blocks through the redundant information, making it suitable for proactive data preservation.[6]History
Origins and Early Development
Parchive emerged in response to prevalent data corruption issues in Usenet file postings, where binary files often arrived incomplete or damaged due to transmission errors across decentralized news servers. In July 2001, Tobias Rieper proposed the initial concept for a parity-based recovery system, with Stefan Wehlus (also known as Beaker) drafting the first specification on July 12, 2001, to enable reliable reconstruction of missing or corrupted data blocks without relying on proprietary software.[2][7] This effort was driven by the need for an open, accessible tool that applied RAID-like redundancy principles to Usenet's multi-part binary distributions, allowing users to verify and repair files using forward error correction techniques.[2] The Parity Volume Set specification reached version 1.0 on October 14, 2001, marking the first stable implementation of PAR, which generated parity files to protect sets of data volumes.[2][7] Initial design goals focused on simplicity and efficiency, prioritizing a lightweight system that could create recovery volumes for binary files while minimizing computational overhead and ensuring compatibility with standard Usenet posting practices, all under an open-source model free from licensing restrictions.[2] Contributions from early project members, including Kilroy Balore and Willem Monsuwe, refined the format to handle common Usenet scenarios effectively.[7] Following its release, PAR 1.0 saw rapid early adoption within Usenet communities, particularly for safeguarding RAR and ZIP archives that were staples of binary file sharing, as users integrated parity files into postings to mitigate losses from incomplete downloads or server expirations.[2] This uptake addressed a critical pain point in peer-to-peer file distribution, fostering greater reliability for large-scale exchanges. The format's evolution to PAR 2.0 later expanded these capabilities to support more flexible recovery options.[2]Key Milestones and Contributors
Following the initial proposal of the Parity Volume Set specification in 2001, key developments in Parchive centered on enhancing the format's capabilities to address limitations in handling larger files and improving repair efficiency. Stefan Wehlus and Tobias Rieper played pivotal roles in drafting the initial PAR 1.0 specification, with additional contributions from Roger Harrison (Kilroy Balore), Willem Monsuwe, Karl Vogel, and Ryan Gallagher.[2][1] A major milestone occurred with the release of the PAR 2.0 specification, initially drafted by Michael Nahas in July 2002, and finalized on May 11, 2003, in collaboration with Peter Clements, Paul Nettle, and Ryan Gallagher.[2][6] This version significantly improved scalability by supporting up to 32,768 data blocks—compared to PAR 1.0's limit of 255—and enhanced efficiency through optimized Reed-Solomon coding adapted from established implementations.[2] PAR 2.0 quickly became the de facto standard for Usenet file verification and repair, superseding PAR 1.0 due to the latter's constraints on file sizes and block counts.[2][1] Efforts to develop PAR 3.0 began with a draft proposal in 2010 by Yutaka Sawada, which was tested in the MultiPar client but never officially released. Michael Nahas revived the initiative in 2019, leading to an alpha draft specification on January 28, 2022, incorporating ideas from Sawada, animetosho, and malaire to add features like deduplication and support for files exceeding 2^64 bytes; however, as of 2025, no completed PAR 3.0 release has materialized.[2][4] The Parchive project, originally hosted on SourceForge, migrated to GitHub around 2019 to facilitate ongoing maintenance and community contributions, with the original SourceForge site becoming inactive thereafter.[1] Michael Nahas has remained a central figure, authoring the PAR 2.0 specification and leading PAR 3.0 drafts, while Clements contributed early PAR 2.0 implementations. As of 2025, PAR 2.0 continues to serve as the widely adopted standard.[2][1]File Format Versions
PAR 1.0
PAR 1.0, the initial version of the Parchive file format, utilizes specific file extensions to organize its index and parity data. The primary index file employs the .par extension, while individual parity volumes use extensions such as .p01, .p02, and so on, up to .p99, followed by .q00 if more are needed.[7] The creation process in PAR 1.0 generates parity data for an entire set of input files by computing Reed-Solomon parity symbols byte-by-byte across all files, padding shorter files with zeros to match the length of the largest file. This results in parity volumes whose size equals that of the largest input file, with the index file containing metadata such as filenames, sizes, MD5 hashes for verification, and a set hash. However, this approach lacks per-file metadata granularity, treating the file set as a unified stream without individualized recovery details for separate files.[7] PAR 1.0 imposes several key limitations stemming from its use of an 8-bit Reed-Solomon code over GF(256). It supports a maximum of fewer than 256 total files and parity volumes combined, allowing only up to 255 recovery blocks. Additionally, while there is no explicit fixed block size limit documented, the byte-level processing effectively constrains handling to sets where the largest file size remains manageable, as parity computation scales directly with it.[7] PAR 1.0 is considered obsolete primarily due to its inadequacy for large modern files, where parity volumes become impractically large, and its inefficiency for multi-file sets, which suffer from excessive padding and limited scalability.[7] These shortcomings were addressed in subsequent versions like PAR 2.0, which introduced block-based processing for better efficiency.[7]PAR 2.0
PAR 2.0 represents a significant advancement in the Parchive format, introducing enhanced scalability and efficiency for data recovery in large archives. The format uses the file extension .par2 for the primary index file, which contains essential metadata such as file descriptions and checksums, while recovery volumes are denoted by extensions like .volXX-YY.par2 (e.g., .vol01-04.par2), allowing for segmented storage of parity data. This structure supports multi-volume sets, where recovery data can be distributed across multiple files to manage large datasets without exceeding file size limits imposed by systems like Usenet.[2][6] Key enhancements in PAR 2.0 include support for up to 32,768 blocks in total (source + recovery), enabling the protection of extensive file sets that PAR 1.0 could not handle due to its 8-bit limitations. Block sizes are variable and must be multiples of 4 bytes, chosen based on the archive's needs. Additionally, per-file MD5 hashing provides robust integrity verification, computing a 16-byte hash from the file name, length, and the MD5 of the first 16 KB of content, ensuring precise identification and repair of individual files. These features allow partial recovery, where the format can reconstruct data as long as the number of available recovery blocks exceeds the lost or damaged source blocks.[6][2] For file identification, PAR 2.0 employs a magic number of 50 41 52 32 00 50 4B 54 in hexadecimal (equivalent to "PAR2\0PKT" in ASCII), which appears at the start of each packet header to confirm the format version and facilitate parsing. As of 2025, PAR 2.0 remains actively used, particularly in Usenet communities for archiving and distributing large binary files, where its mature implementation continues to provide reliable error correction without the experimental changes proposed in PAR 3.0 drafts.[6][2]PAR 3.0
PAR 3.0 represents an experimental draft specification aimed at addressing limitations in earlier Parchive versions, particularly for handling contemporary data storage needs. Proposed in its current form as an alpha draft released on January 28, 2022, by Michael Nahas—the original author of the PAR 2.0 specification—it incorporates ideas from contributors including Yutaka Sawada, animetosho, and malaire. This draft builds on an earlier, abandoned proposal from 2010 developed by Yutaka Sawada and others, which included a test implementation but was discontinued due to lack of progress. An initial effort to revive development began in 2019 through discussions on the par2cmdline GitHub repository, leading to the 2022 draft that updates Reed-Solomon parameters for improved error correction flexibility, such as support for any Galois field and multiple linear codes beyond traditional Reed-Solomon. Key proposed extensions in the PAR 3.0 draft include new file extensions of .par3 for generated recovery files, enabling support for datasets exceeding the block count limitations of PAR 2.0 (which caps at 32,768 blocks total, restricting total data to around 64 MB with a 2 KB block size). It introduces 64-bit integers for block sizes and indices, allowing handling of individual files up to 2^64 - 1 bytes and accommodating future storage expansions without inherent size constraints. Additional enhancements encompass deduplication to reuse identical data blocks across files, reducing redundancy requirements; incremental backup capabilities that leverage prior PAR 3.0 recovery data; and "Par inside" embedding to protect compressed archives like ZIP files by integrating recovery packets directly. While the format does not include built-in compression, it recommends external tools such as gzip with rsyncable options for better integration, and its matrix-based recovery design supports optimizations for multi-threaded repair processes, including parallel Gaussian elimination for inversion. Despite these advancements, the PAR 3.0 draft faces significant challenges as of 2025, including the absence of a complete reference implementation, with the par3cmdline project remaining in development without stable releases or widespread testing.[2] Compatibility issues persist, as existing PAR 2.0 tools cannot process PAR 3.0 files, and the draft prioritizes new features over full backward compatibility with PAR 2.0 structures, potentially requiring separate clients for legacy support. The specification briefly notes mechanisms for reading PAR 2.0 files in new clients, but this is not a core focus. Potential benefits of PAR 3.0 include enhanced efficiency for modern applications such as cloud backups and large-scale data hoarding, where deduplication and larger file support could minimize storage overhead and improve recovery speeds on multi-core systems. However, with no finalized standard or broad adoption by 2025—evidenced by ongoing open issues in the par3cmdline repository and limited mentions in preservation formats documentation—the format remains experimental and unused in production environments.Technical Specifications
Reed-Solomon Error Correction
Reed-Solomon codes are a class of block-based error-correcting codes that operate over finite fields, known as Galois fields, to enable the detection and correction of errors in data transmission or storage. In Parchive, these codes are implemented over the finite field GF(2^{16}), which consists of 65,536 elements represented as 16-bit words, allowing efficient computation using bitwise operations on words. This field choice facilitates the handling of word-sized symbols, aligning with typical file block sizes in data recovery scenarios.[6][8] The fundamental structure of a Reed-Solomon code is denoted as RS(n, k), where n is the total number of symbols in a codeword (data plus parity), and k is the number of data symbols. The code generates m = n - k parity symbols, enabling correction of up to \lfloor m/2 \rfloor symbol errors in the general case, or up to m erasures (errors with known positions) when positions are identifiable. In mathematical terms, the error-correcting capability t satisfies 2t \leq m for errors, but extends to t \leq m for erasures, as the known locations reduce the degrees of freedom needed for recovery. This property is derived from the code's minimum distance d = n - k + 1, which bounds the number of correctable errors or erasures.[8] In Parchive, particularly in the PAR2 format, Reed-Solomon codes are applied to recover lost data blocks treated as erasures, where m recovery blocks permit the reconstruction of up to m missing data blocks, provided the total number of blocks n = k + m does not exceed 65,536. This setup ensures that as long as the number of lost blocks does not exceed the available recovery blocks, full data restoration is possible through systematic encoding that preserves the original data while appending parity information.[6] The encoding process in Parchive computes parity blocks as linear combinations of the data blocks over GF(2^{16}). Each recovery block is formed by multiplying data block symbols by predefined coefficients—typically powers of a primitive element in the field—and summing them modulo the field characteristic. Formally, if the data blocks are vectors in GF(2^{16})^s (where s is the block length in 16-bit symbols), the i-th parity block is given by , where is a primitive root of the field. This Vandermonde-based construction ensures the parity matrix is invertible, allowing recovery.[6][8] Decoding in Parchive leverages the erasure-aware nature of lost blocks by first identifying syndromes to quantify discrepancies. The syndrome vector is computed as , where are the received (possibly incomplete) symbols and non-erased positions are used. For erasures, the known positions simplify the process: the Berlekamp-Massey algorithm is employed to derive the error locator polynomial , which identifies error (or erasure) positions by finding roots of . Error magnitudes are then solved using the Forney formula or equivalent, yielding , where is the l-th root, is the error evaluator polynomial, and v is the erasure count. This algebraic approach efficiently reconstructs the original data by subtracting errors from the received vector.[6][8]File Structure and Components
The Parchive file set in the PAR 2.0 format comprises a primary index file with the.par2 extension, which stores metadata about the source files without containing recovery data, along with optional recovery volume files named in the pattern .par2.volXX-YY.par2 (for example, .par2.vol00-09.par2), each holding multiple recovery blocks for data repair.[6] These recovery volumes can be generated in sufficient quantity to recover up to a specified percentage of damaged data, with the overall set being splittable into multiple files for handling large datasets, where critical metadata packets are duplicated across splits to maintain accessibility even if individual files are lost.[6]
The index file is structured as a sequence of variable-length packets, starting with a common header for each packet that includes an 8-byte magic number (PAR2\0PKT), an 8-byte length field (always a multiple of 4 bytes), a 16-byte MD5 checksum covering the packet body excluding the header, a 16-byte unique Recovery Set ID shared across the entire Parchive set, and a 16-byte packet type identifier.[6] The core Main Packet (type PAR 2.0\0Main\0\0\0\0) follows the header and specifies the uniform slice size for data blocks (an 8-byte value that is a power of 2, ranging from 4 KB to 64 MB), the total count of source files (4 bytes), and two arrays listing the 16-byte File IDs for the non-recovery source files and any recovery files used in creation.[6] This allows derivation of the total block count by dividing the combined sizes of all source files by the slice size, accounting for padding in the final slice of each file if necessary.[6]
For each source file, a File Description Packet (type PAR 2.0\0FileDesc\0\0\0) provides essential metadata, including the file's 16-byte File ID (derived from an MD5 hash of the filename and length), a 16-byte MD5 hash of the entire file contents, a 16-byte MD5 hash of the first 16 KB (or the full file if smaller), an 8-byte unsigned integer for the file size in bytes, and the variable-length ASCII-encoded filename padded to a multiple of 4 bytes.[6] Complementing this, Input File Slice Checksum Packets (type PAR 2.0\0IFSC\0\0\0\0) list per-file details with the File ID followed by an array of slice-specific checksums: a 16-byte MD5 hash and a 4-byte CRC-32 checksum for each data slice, enabling granular verification of block integrity.[6] An optional Creator Packet (type PAR 2.0\0Creator\0\0\0\0) at the end records the software used to generate the set, such as the client name in ASCII.[6]
Recovery volume files follow the same packet-based layout as the index file, with headers identical in structure, but their primary content consists of Recovery Slice Packets (type PAR 2.0\0RecvSlic\0\0), each beginning with a 4-byte unsigned integer exponent that serves as the block index (a power of the primitive element in the Galois field, indicating the recovery block's position in the sequence), followed by the fixed-length recovery data block matching the slice size from the Main Packet.[6] These packets include an MD5 checksum in the header for overall integrity, while the CRC-32 checksums from the index's Input File Slice Checksum Packets support verification of the underlying source data slices during repair operations.[6] The recovery blocks are generated using Reed-Solomon error-correcting codes to allow reconstruction of any combination of missing source blocks up to the number of available recovery blocks.[6]
Verification involves scanning the source files referenced in the index, computing their full MD5 hashes and comparing them to the stored values in the File Description Packets, then breaking each file into slices per the specified size and checking both MD5 and CRC-32 against the Input File Slice Checksum Packets to identify any corrupted or missing blocks.[6] If discrepancies are found and sufficient recovery blocks are present in the parity volumes, repair proceeds by treating the available source and recovery blocks as a system of linear equations over the Galois field, solving for the missing blocks using the predefined coefficients from the block indices to regenerate the exact original data.[6] This process ensures that the Parchive set can restore integrity without requiring the original creation software, as long as the index file remains intact.[6]
Software Implementations
Command-Line Tools
par2cmdline serves as the primary open-source command-line tool for creating, verifying, and repairing PAR 2.0 files, offering cross-platform support for POSIX systems including Linux and macOS, as well as Windows via compilation with Microsoft Visual Studio.[3] Originally developed on SourceForge and hosted on GitHub since 2015, it implements the full PAR 2.0 specification, enabling users to generate recovery volumes that can restore up to a specified percentage of damaged or missing data using Reed-Solomon error correction.[3] The tool is particularly suited for scripting and server environments due to its text-based interface and lack of graphical dependencies. Key features include customizable redundancy levels via the-r option, such as -r25 to create recovery data equivalent to 25% of the source files' size, allowing repair of up to that proportion of data loss.[3] Multi-threading support, enabled through OpenMP and configurable with the -t flag (e.g., -t4 for four threads), accelerates processing on multi-core systems.[3] Volume splitting is handled with options like -n to specify the number of recovery files (up to 31) or -u for uniform file sizes, facilitating distribution across storage limits or networks.[3] Additional capabilities encompass block size adjustment with -s for optimized performance on large files and purging of source files upon successful repair using -p.
A typical usage example for creating a PAR 2.0 archive with 10% redundancy is:
par2 create -r10 archive.par2 files/*
par2 create -r10 archive.par2 files/*
archive.par2 and archive.vol*.par2 as needed.[3] Verification runs with par2 verify archive.par2, checking for damage without repair, while par2 repair archive.par2 attempts automatic recovery if issues are detected.[3]
Other command-line tools include libpar2, a C++ library extracted from par2cmdline for developers integrating Parchive functionality into custom applications.[9] Graphical front-ends, such as those built atop libpar2, provide visual alternatives but rely on these core command-line implementations for backend processing.[3]
