Hubbry Logo
Error detection and correctionError detection and correctionMain
Open search
Error detection and correction
Community hub
Error detection and correction
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Error detection and correction
Error detection and correction
from Wikipedia
To clean up transmission errors introduced by Earth's atmosphere (left), Goddard scientists applied Reed–Solomon error correction (right), which is commonly used in CDs and DVDs. Typical errors include missing pixels (white) and false signals (black). The white stripe indicates a brief period when transmission was interrupted.

In information theory and coding theory with applications in computer science and telecommunications, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data in many cases.

Definitions

[edit]

Error detection is the detection of errors caused by noise or other impairments during transmission from the transmitter to the receiver.

Error correction is the detection of errors and reconstruction of the original, error-free data.

History

[edit]

In classical antiquity, copyists of the Hebrew Bible were paid for their work according to the number of stichs (lines of verse). As the prose books of the Bible were hardly ever written in stichs, the copyists, in order to estimate the amount of work, had to count the letters.[1] This also helped ensure accuracy in the transmission of the text with the production of subsequent copies.[2][3] Between the 7th and 10th centuries CE a group of Jewish scribes formalized and expanded this to create the Numerical Masorah to ensure accurate reproduction of the sacred text. It included counts of the number of words in a line, section, book and groups of books, noting the middle stich of a book, word use statistics, and commentary.[1] Standards became such that a deviation in even a single letter in a Torah scroll was considered unacceptable.[4] The effectiveness of their error correction method was verified by the accuracy of copying through the centuries demonstrated by discovery of the Dead Sea Scrolls in 1947–1956, dating from c. 150 BCE-75 CE.[5]

The modern development of error correction codes is credited to Richard Hamming in 1947.[6] A description of Hamming's code appeared in Claude Shannon's A Mathematical Theory of Communication[7] and was quickly generalized by Marcel J. E. Golay.[8]

Principles

[edit]

All error-detection and correction schemes add some redundancy (i.e., some extra data) to a message, which receivers can use to check consistency of the delivered message and to recover data that has been determined to be corrupted. Error detection and correction schemes can be either systematic or non-systematic. In a systematic scheme, the transmitter sends the original (error-free) data and attaches a fixed number of check bits (or parity data), which are derived from the data bits by some encoding algorithm. If error detection is required, a receiver can simply apply the same algorithm to the received data bits and compare its output with the received check bits; if the values do not match, an error has occurred at some point during the transmission. If error correction is required, a receiver can apply the decoding algorithm to the received data bits and the received check bits to recover the original error-free data. In a system that uses a non-systematic code, the original message is transformed into an encoded message carrying the same information and that has at least as many bits as the original message.

Good error control performance requires the scheme to be selected based on the characteristics of the communication channel. Common channel models include memoryless models where errors occur randomly and with a certain probability, and dynamic models where errors occur primarily in bursts. Consequently, error-detecting and -correcting codes can be generally distinguished between random-error-detecting/correcting and burst-error-detecting/correcting. Some codes can also be suitable for a mixture of random errors and burst errors.

If the channel characteristics cannot be determined, or are highly variable, an error-detection scheme may be combined with a system for retransmissions of erroneous data. This is known as automatic repeat request (ARQ), and is most notably used in the Internet. An alternate approach for error control is hybrid automatic repeat request (HARQ), which is a combination of ARQ and error-correction coding.

Types of error correction

[edit]

There are three major types of error correction:[9]

Automatic repeat request

[edit]

Automatic repeat request (ARQ) is an error control method for data transmission that makes use of error-detection codes, acknowledgment and/or negative acknowledgment messages, and timeouts to achieve reliable data transmission. An acknowledgment is a message sent by the receiver to indicate that it has correctly received a data frame.

Usually, when the transmitter does not receive the acknowledgment before the timeout occurs (i.e., within a reasonable amount of time after sending the data frame), it retransmits the frame until it is either correctly received or the error persists beyond a predetermined number of retransmissions.

Three types of ARQ protocols are Stop-and-wait ARQ, Go-Back-N ARQ, and Selective Repeat ARQ.

ARQ is appropriate if the communication channel has varying or unknown capacity, such as is the case on the Internet. However, ARQ requires the availability of a back channel, results in possibly increased latency due to retransmissions, and requires the maintenance of buffers and timers for retransmissions, which in the case of network congestion can put a strain on the server and overall network capacity.[10]

For example, ARQ is used on shortwave radio data links in the form of ARQ-E, or combined with multiplexing as ARQ-M.

Forward error correction

[edit]

Forward error correction (FEC) is a process of adding redundant data such as an error-correcting code (ECC) to a message so that it can be recovered by a receiver even when a number of errors (up to the capability of the code being used) are introduced, either during the process of transmission or on storage. Since the receiver does not have to ask the sender for retransmission of the data, a backchannel is not required in forward error correction. Error-correcting codes are used in lower-layer communication such as cellular network, high-speed fiber-optic communication and Wi-Fi,[11][12] as well as for reliable storage in media such as flash memory, hard disk and RAM.[13]

Error-correcting codes are usually distinguished between convolutional codes and block codes:

Shannon's theorem is an important theorem in forward error correction, and describes the maximum information rate at which reliable communication is possible over a channel that has a certain error probability or signal-to-noise ratio (SNR). This strict upper limit is expressed in terms of the channel capacity. More specifically, the theorem says that there exist codes such that with increasing encoding length the probability of error on a discrete memoryless channel can be made arbitrarily small, provided that the code rate is smaller than the channel capacity. The code rate is defined as the fraction k/n of k source symbols and n encoded symbols.

The actual maximum code rate allowed depends on the error-correcting code used, and may be lower. This is because Shannon's proof was only of existential nature, and did not show how to construct codes that are both optimal and have efficient encoding and decoding algorithms.

Hybrid schemes

[edit]

Hybrid ARQ is a combination of ARQ and forward error correction. There are two basic approaches:[10]

  • Messages are always transmitted with FEC parity data (and error-detection redundancy). A receiver decodes a message using the parity information and requests retransmission using ARQ only if the parity data was not sufficient for successful decoding (identified through a failed integrity check).
  • Messages are transmitted without parity data (only with error-detection information). If a receiver detects an error, it requests FEC information from the transmitter using ARQ and uses it to reconstruct the original message.

The latter approach is particularly attractive on an erasure channel when using a rateless erasure code.

Types of error detection

[edit]

Error detection is most commonly realized using a suitable hash function (or specifically, a checksum, cyclic redundancy check or other algorithm). A hash function adds a fixed-length tag to a message, which enables receivers to verify the delivered message by recomputing the tag and comparing it with the one provided.

There exists a vast variety of different hash function designs. However, some are of particularly widespread use because of either their simplicity or their suitability for detecting certain kinds of errors (e.g., the cyclic redundancy check's performance in detecting burst errors).

Minimum distance coding

[edit]

A random-error-correcting code based on minimum distance coding can provide a strict guarantee on the number of detectable errors, but it may not protect against a preimage attack.

Repetition codes

[edit]

A repetition code is a coding scheme that repeats the bits across a channel to achieve error-free communication. Given a stream of data to be transmitted, the data are divided into blocks of bits. Each block is transmitted some predetermined number of times. For example, to send the bit pattern 1011, the four-bit block can be repeated three times, thus producing 1011 1011 1011. If this twelve-bit pattern was received as 1010 1011 1011 – where the first block is unlike the other two – an error has occurred.

A repetition code is very inefficient and can be susceptible to problems if the error occurs in exactly the same place for each group (e.g., 1010 1010 1010 in the previous example would be detected as correct). The advantage of repetition codes is that they are extremely simple, and are in fact used in some transmissions of numbers stations.[14][15]

Parity bit

[edit]

A parity bit is a bit that is added to a group of source bits to ensure that the number of set bits (i.e., bits with value 1) in the outcome is even or odd. It is a very simple scheme that can be used to detect single or any other odd number (i.e., three, five, etc.) of errors in the output. An even number of flipped bits will make the parity bit appear correct even though the data is erroneous.

Parity bits added to each word sent are called transverse redundancy checks, while those added at the end of a stream of words are called longitudinal redundancy checks. For example, if each of a series of m-bit words has a parity bit added, showing whether there were an odd or even number of ones in that word, any word with a single error in it will be detected. It will not be known where in the word the error is, however. If, in addition, after each stream of n words a parity sum is sent, each bit of which shows whether there were an odd or even number of ones at that bit-position sent in the most recent group, the exact position of the error can be determined and the error corrected. This method is only guaranteed to be effective, however, if there are no more than 1 error in every group of n words. With more error correction bits, more errors can be detected and in some cases corrected.

There are also other bit-grouping techniques.

Checksum

[edit]

A checksum of a message is a modular arithmetic sum of message code words of a fixed word length (e.g., byte values). The sum may be negated by means of a ones'-complement operation prior to transmission to detect unintentional all-zero messages.

Checksum schemes include parity bits, check digits, and longitudinal redundancy checks. Some checksum schemes, such as the Damm algorithm, the Luhn algorithm, and the Verhoeff algorithm, are specifically designed to detect errors commonly introduced by humans in writing down or remembering identification numbers.

Cyclic redundancy check

[edit]

A cyclic redundancy check (CRC) is a non-secure hash function designed to detect accidental changes to digital data in computer networks. It is not suitable for detecting maliciously introduced errors. It is characterized by specification of a generator polynomial, which is used as the divisor in a polynomial long division over a finite field, taking the input data as the dividend. The remainder becomes the result.

A CRC has properties that make it well suited for detecting burst errors. CRCs are particularly easy to implement in hardware and are therefore commonly used in computer networks and storage devices such as hard disk drives.

The parity bit can be seen as a special-case 1-bit CRC.

Cryptographic hash function

[edit]

The output of a cryptographic hash function, also known as a message digest, can provide strong assurances about data integrity, whether changes of the data are accidental (e.g., due to transmission errors) or maliciously introduced. Any modification to the data will likely be detected through a mismatching hash value. Furthermore, given some hash value, it is typically infeasible to find some input data (other than the one given) that will yield the same hash value. If an attacker can change not only the message but also the hash value, then a keyed hash or message authentication code (MAC) can be used for additional security. Without knowing the key, it is not possible for the attacker to easily or conveniently calculate the correct keyed hash value for a modified message.

Digital signature

[edit]

Digital signatures can provide strong assurances about data integrity, whether the changes of the data are accidental or maliciously introduced. Digital signatures are perhaps most notable for being part of the HTTPS protocol for securely browsing the web.

Error correction code

[edit]

Any error-correcting code can be used for error detection. A code with minimum Hamming distance, d, can detect up to d − 1 errors in a code word. Using minimum-distance-based error-correcting codes for error detection can be suitable if a strict limit on the minimum number of errors to be detected is desired.

Codes with minimum Hamming distance d = 2 are degenerate cases of error-correcting codes and can be used to detect single errors. The parity bit is an example of a single-error-detecting code.

Applications

[edit]

Applications that require low latency (such as telephone conversations) cannot use automatic repeat request (ARQ); they must use forward error correction (FEC). By the time an ARQ system discovers an error and re-transmits it, the re-sent data will arrive too late to be usable.

Applications where the transmitter immediately forgets the information as soon as it is sent (such as most television cameras) cannot use ARQ; they must use FEC because when an error occurs, the original data is no longer available.

Applications that use ARQ must have a return channel; applications having no return channel cannot use ARQ. Applications that require extremely low error rates (such as digital money transfers) must use ARQ due to the possibility of uncorrectable errors with FEC.

Reliability and inspection engineering also make use of the theory of error-correcting codes,[16] as well as natural language.[17]

Internet

[edit]

In a typical TCP/IP stack, error control is performed at multiple levels:

  • Each Ethernet frame uses CRC-32 error detection. Frames with detected errors are discarded by the receiver hardware.
  • The IPv4 header contains a checksum protecting the contents of the header. Packets with incorrect checksums are dropped within the network or at the receiver.
  • The checksum was omitted from the IPv6 header in order to minimize processing costs in network routing and because current link layer technology is assumed to provide sufficient error detection (see also RFC 3819).
  • UDP has an optional checksum covering the payload and addressing information in the UDP and IP headers. Packets with incorrect checksums are discarded by the network stack. The checksum is optional under IPv4, and required under IPv6. When omitted, it is assumed the data-link layer provides the desired level of error protection.
  • TCP provides a checksum for protecting the payload and addressing information in the TCP and IP headers. Packets with incorrect checksums are discarded by the network stack and eventually get retransmitted using ARQ, either explicitly (such as through three-way handshake) or implicitly due to a timeout.

Deep-space telecommunications

[edit]

The development of error-correction codes was tightly coupled with the history of deep-space missions due to the extreme dilution of signal power over interplanetary distances, and the limited power availability aboard space probes. Whereas early missions sent their data uncoded, starting in 1968, digital error correction was implemented in the form of (sub-optimally decoded) convolutional codes and Reed–Muller codes.[18] The Reed–Muller code was well suited to the noise the spacecraft was subject to (approximately matching a bell curve), and was implemented for the Mariner spacecraft and used on missions between 1969 and 1977.

The Voyager 1 and Voyager 2 missions, which started in 1977, were designed to deliver color imaging and scientific information from Jupiter and Saturn.[19] This resulted in increased coding requirements, and thus, the spacecraft were supported by (optimally Viterbi-decoded) convolutional codes that could be concatenated with an outer Golay (24,12,8) code. The Voyager 2 craft additionally supported an implementation of a Reed–Solomon code. The concatenated Reed–Solomon–Viterbi (RSV) code allowed for very powerful error correction, and enabled the spacecraft's extended journey to Uranus and Neptune. After ECC system upgrades in 1989, both crafts used V2 RSV coding.

The Consultative Committee for Space Data Systems currently recommends usage of error correction codes with performance similar to the Voyager 2 RSV code as a minimum. Concatenated codes are increasingly falling out of favor with space missions, and are replaced by more powerful codes such as Turbo codes or LDPC codes.

The different kinds of deep space and orbital missions that are conducted suggest that trying to find a one-size-fits-all error correction system will be an ongoing problem. For missions close to Earth, the nature of the noise in the communication channel is different from that which a spacecraft on an interplanetary mission experiences. Additionally, as a spacecraft increases its distance from Earth, the problem of correcting for noise becomes more difficult.

Satellite broadcasting

[edit]

The demand for satellite transponder bandwidth continues to grow, fueled by the desire to deliver television (including new channels and high-definition television) and IP data. Transponder availability and bandwidth constraints have limited this growth. Transponder capacity is determined by the selected modulation scheme and the proportion of capacity consumed by FEC.

Data storage

[edit]

Error detection and correction codes are often used to improve the reliability of data storage media.[20] A parity track capable of detecting single-bit errors was present on the first magnetic tape data storage in 1951. The optimal rectangular code used in group coded recording tapes not only detects but also corrects single-bit errors. Some file formats, particularly archive formats, include a checksum (most often CRC-32]) to detect corruption and truncation and can employ redundancy or parity files to recover portions of corrupted data. Reed-Solomon codes are used in compact discs to correct errors caused by scratches.

Modern hard drives use Reed–Solomon codes to detect and correct minor errors in sector reads, and to recover corrupted data from failing sectors and store that data in the spare sectors.[21] RAID systems use a variety of error correction techniques to recover data when a hard drive completely fails. Filesystems such as ZFS or Btrfs, as well as some RAID implementations, support data scrubbing and resilvering, which allows bad blocks to be detected and (hopefully) recovered before they are used.[22] The recovered data may be re-written to exactly the same physical location, to spare blocks elsewhere on the same piece of hardware, or the data may be rewritten onto replacement hardware.

Error-correcting memory

[edit]

Dynamic random-access memory (DRAM) may provide stronger protection against soft errors by relying on error-correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for mission-critical applications, such as scientific computing, financial, medical, etc. as well as extraterrestrial applications due to the increased radiation in space.

Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy. Interleaving allows distributing the effect of a single cosmic ray potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single-event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error-correcting code), and the illusion of an error-free memory system may be maintained.[23]

In addition to hardware providing features required for ECC memory to operate, operating systems usually contain related reporting facilities that are used to provide notifications when soft errors are transparently recovered. One example is the Linux kernel's EDAC subsystem (previously known as Bluesmoke), which collects the data from error-checking-enabled components inside a computer system; besides collecting and reporting back the events related to ECC memory, it also supports other checksumming errors, including those detected on the PCI bus.[24][25][26] A few systems[specify] also support memory scrubbing to catch and correct errors early before they become unrecoverable.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Error detection and correction are techniques employed in digital communication and systems to identify errors introduced during transmission or storage and, in the case of correction, to restore the original without retransmission. These methods rely on adding redundant information to the original , allowing the receiver to detect discrepancies caused by , interference, or physical defects, thereby ensuring across unreliable channels. The fundamental principle underlying these techniques is the use of , which introduces redundancy through codewords with a minimum —the number of positions at which two strings differ—sufficient to distinguish errors. Error detection identifies anomalies but requires retransmission or other recovery, while correction actively repairs errors up to a certain threshold; for instance, a with Hamming distance d2t+1d \geq 2t + 1 can correct up to tt errors. This distinction is critical, as detection schemes like parity checks or cyclic redundancy checks (CRC) are simpler and detect single or burst errors with low overhead, whereas correction methods demand more redundancy for decoding complexity. Key error-detecting codes include the , which appends a single bit to achieve even or odd parity and detects odd-numbered errors, and CRC codes, which use polynomial division to generate checksums effective against burst errors in networks and storage. For correction, prominent examples are Hamming codes, such as the (7,4) code that uses three parity bits to correct single-bit errors in four data bits, and more advanced linear block or convolutional codes applied in high-reliability scenarios. These approaches trace their theoretical foundations to Claude Shannon's 1948 work on , which established limits on error-free communication over noisy channels, though practical codes approximate but do not fully achieve the Shannon limit. Applications of error detection and correction span critical domains, including (e.g., single-error-correcting double-error-detecting codes in DRAM), satellite and deep-space communications (e.g., convolutional codes with Viterbi decoding), and media like CDs and hard drives (e.g., Reed-Solomon codes for burst error correction). Trade-offs in design involve balancing code rate (data bits to total bits), , and error protection levels, tailored to channel characteristics like random bit flips or burst errors. Ongoing advancements continue to enhance efficiency, particularly in emerging fields like , where specialized codes address decoherence.

Fundamentals

Definitions

Error detection is the process of identifying the presence of errors in data that has been transmitted over a or stored in a medium, without necessarily determining the location or correcting the errors. This technique typically involves adding redundant information to the original data, allowing the receiver to verify integrity and flag potential issues for further action. Error correction extends beyond mere identification by not only detecting errors but also locating their positions and determining their nature to restore the original data as accurately as possible. It relies on sufficient to enable the receiver to reconstruct the intended message even when multiple errors occur within a defined tolerance. Error control coding encompasses the systematic addition of redundant bits or symbols to the source data, forming codewords that facilitate either error detection, correction, or both, thereby enhancing the reliability of data transmission over noisy channels. This approach trades off some transmission efficiency for improved error resilience, as seen in various coding schemes used in digital communications. The primary distinction between error detection and error correction lies in their outcomes: detection typically flags errors to trigger mechanisms like retransmission in protocols such as (ARQ), while correction enables automatic fixing of errors up to a certain rate through (FEC) without requiring sender involvement. Detection is simpler and requires less redundancy, suiting scenarios where channel feedback is available, whereas correction is more robust for one-way or high-latency links. A basic example involves a single-bit in , such as a transmitted byte "10110110" becoming "10111110" due to flipping the fifth bit; simple parity detection can identify the odd parity mismatch to signal the , while a more advanced code like the could both detect and correct it by pinpointing the flipped bit. serves as a foundational measure in both processes, quantifying the number of positions at which two codewords differ to assess error-handling capabilities.

Error Types and Models

Errors in digital communication systems are broadly classified into single-bit errors, multi-bit errors, and burst errors. Single-bit errors occur when only one bit in a transmitted sequence is altered, typically due to random noise such as thermal noise in channels. Multi-bit errors involve alterations to two or more bits, which may arise from more severe interference or hardware faults, and can challenge error-correcting codes designed primarily for isolated flips. Burst errors consist of consecutive erroneous bits within a defined block or window, often caused by phenomena like impulse noise or in channels, where multiple adjacent bits are corrupted simultaneously. Mathematical models characterize these errors to analyze channel behavior and code performance. The binary symmetric channel (BSC) models symmetric errors, assuming each bit is independently flipped with probability pp (where 0<p<0.50 < p < 0.5), representing random, uncorrelated bit errors common in additive white Gaussian noise environments. In contrast, the binary asymmetric channel (BAC) accounts for unequal flip probabilities, such as p0p_0 for 0-to-1 transitions and p1p_1 for 1-to-0, which better suits channels with biased noise like certain optical or storage systems. For bursty channels exhibiting clustered errors, the Gilbert-Elliott model describes a two-state Markov process: a "good" state with low error probability and a "bad" state with high error probability, enabling simulation of error bursts in fading or multipath propagation scenarios. Key metrics quantify error prevalence across these models. The bit error rate (BER) measures the fraction of bits received incorrectly out of the total transmitted bits, serving as a fundamental indicator of channel reliability in binary systems. The symbol error rate (SER) extends this to higher-order modulation schemes, representing the probability that an entire symbol (comprising multiple bits) is decoded erroneously, which is particularly relevant for evaluating performance in M-ary signaling. The packet error rate (PER), defined as the ratio of packets containing at least one error to the total packets sent, assesses end-to-end reliability in networked systems, aggregating bit-level errors into practical data unit failures. Understanding code performance requires assumptions about error distributions, such as independence in BSC for capacity calculations or burst lengths in Gilbert-Elliott for throughput predictions, as these inform the design of detection and correction strategies under modeled conditions.

Historical Overview

Origins and Early Developments

In the 19th century, the rise of electrical telegraphy introduced the need for reliable long-distance communication, with engineers like Émile Baudot developing synchronous coding systems in the 1870s that used fixed-length binary-like signals to improve transmission efficiency and synchronization. enabled multiplexed channels, which supported better timing uniformity in electrical signaling. World War II accelerated theoretical advancements in communication reliability, as cryptographers and engineers grappled with noisy channels in military applications; Claude Shannon's seminal 1948 paper, "A Mathematical Theory of Communication," established information theory's core principles, demonstrating that reliable transmission is possible over imperfect channels by quantifying noise limits and redundancy needs. This work provided the mathematical groundwork for distinguishing detectable errors from correctable ones in digital systems. By the early 1950s, practical computing demands drove innovations in error control, exemplified by 's 1950 paper at Bell Labs, where frustrations with frequent failures in relay-based computers—often due to unreliable vacuum tube components—prompted the development of systematic error-correcting codes. Hamming's approach extended simple parity checks, which had emerged in the late 1940s for basic error detection in punched-card readers and memory systems, into more robust schemes capable of single-error correction. The parity bit, a single redundant bit ensuring an even or odd number of 1s in a data word, became a standard for simple detection in 1950s computing hardware, such as IBM's early disk systems. , as an early outcome, illustrated how parity could be arranged in positions to pinpoint and fix errors automatically.

Key Milestones in the 20th Century

In 1950, Richard W. Hamming introduced Hamming codes, a class of linear error-correcting codes capable of detecting up to two errors and correcting single errors in binary data transmission, motivated by the need to handle failures in early computing machines at Bell Laboratories. These codes, constructed using parity-check matrices derived from all nonzero binary vectors of a fixed length, represented a foundational advancement in systematic error correction, enabling reliable operation in punched-card readers and vacuum-tube computers without manual intervention. The late 1950s and early 1960s saw significant progress in cyclic codes, a subclass of linear block codes where any cyclic shift of a codeword remains a codeword, facilitating efficient encoding and decoding via shift registers. In 1959, Alexis Hocquenghem developed a method for constructing cyclic codes with specified error-correcting capabilities over finite fields, followed independently in 1960 by Raj Chandra Bose and Dinesh K. Ray-Chaudhuri, who extended the approach to binary group codes capable of correcting multiple random errors. These Bose-Chaudhuri-Hocquenghem (BCH) codes, built using roots of unity in Galois fields, provided a flexible framework for designing codes with predetermined minimum distances, influencing subsequent standards in digital communications. In 1961, W. Wesley Peterson further advanced cyclic codes by demonstrating their use for error detection through polynomial division, laying the groundwork for practical implementations like cyclic redundancy checks (CRC) in noisy channels. Also in 1960, Irving S. Reed and Gustave Solomon proposed Reed-Solomon (RS) codes, a non-binary cyclic code variant operating over finite fields of order greater than the symbol size, particularly effective for correcting burst errors common in storage and transmission systems. RS codes achieve the Singleton bound, offering optimal trade-offs between code rate and error-correcting capability, and were initially motivated by applications in military communications requiring resilience to consecutive symbol errors. Their structure, based on evaluating polynomials at distinct points, enabled decoding algorithms that locate and correct errors up to half the minimum distance, making them suitable for early satellite and telemetry systems. In 1967, Andrew J. Viterbi developed the , a dynamic programming method for maximum-likelihood decoding of convolutional codes, which model error-prone channels as finite-state machines producing continuous streams of data. Although proposed theoretically, the algorithm gained practical traction in the 1970s for decoding convolutional codes in deep-space missions and satellite communications, where it efficiently navigated trellis structures to minimize computational complexity while approaching Shannon limits for low-rate codes. During the 1960s, the International Telegraph and Telephone Consultative Committee (CCITT, now ITU-T) standardized CRC polynomials for telecommunications, adopting specific generator polynomials like x^16 + x^12 + x^5 + 1 for 16-bit checks to detect burst errors in asynchronous data links and early packet networks. These standards, formalized in recommendations such as V.41 by the early 1970s but developed amid 1960s research, ensured interoperability in international telephony and data transmission by specifying CRC computation for frame validation. In the 1980s and 1990s, concatenated codes—combining an outer code for burst correction with an inner code for random errors—emerged as a high-performance approach, building on John M. Wozencraft's early ideas from his 1957 MIT thesis on sequential decoding and further refined in G. David Forney's 1966 work. Supervised by Wozencraft, Forney's concatenated schemes demonstrated exponential error reduction with polynomial-time decoding, influencing deep-space probes and CD-ROM storage. Concurrently, Automatic Repeat reQuest (ARQ) protocols rose in prominence within the , the precursor to the internet, where the 1822 host-IMP protocol incorporated acknowledgment-based retransmissions to handle packet errors over unreliable links, enabling reliable end-to-end communication across the network's initial deployments in the 1970s and expansions through the 1980s.

Recent Advances

In the late 1990s and early 2000s, turbo codes emerged as a breakthrough in error-correcting coding, introduced by Claude Berrou, Alain Glavieux, and Punya Thitimajshima in 1993. These parallel concatenated convolutional codes, decoded via iterative algorithms, achieve performance within 0.7 dB of the Shannon limit at a bit error rate (BER) of 10^{-5}, enabling near-capacity operation in noisy channels. Their adoption in 3G mobile standards like UMTS marked a shift toward capacity-approaching codes for wireless communications. Low-density parity-check (LDPC) codes, originally conceived by Robert Gallager in 1963, underwent a renaissance in the early 2000s with the development of efficient iterative belief propagation decoding algorithms, notably advanced by Richardson and Urbanke in 2001 through density evolution techniques that optimized code ensembles for capacity achievement. This revival led to their standardization in the DVB-S2 satellite broadcasting standard in 2005, where LDPC codes with rates from 1/4 to 9/10 provide coding gains up to 10 dB, supporting high-throughput video delivery over fading channels. Further, in 2018, 3GPP adopted quasi-cyclic LDPC codes for 5G NR data channels in Release 15, enabling flexible rates and block lengths up to 8448 bits while maintaining low latency and high reliability for enhanced mobile broadband. The 2010s saw the introduction of polar codes by Erdal Arikan in 2009, which exploit channel polarization to achieve symmetric capacity on binary-input discrete memoryless channels through a recursive construction that transforms unreliable channels into reliable ones. These codes were selected by 3GPP in 2016 for 5G NR control channels due to their short-blocklength efficiency and low decoding complexity via successive cancellation, supporting block error rates below 10^{-3} for URLLC scenarios. In quantum error correction, stabilizer codes, formalized by Daniel Gottesman in 1997, and surface codes, proposed by Alexei Kitaev in 1997, provide the theoretical foundation for fault-tolerant quantum computing by encoding logical qubits into entangled states that detect and correct errors without collapsing superposition. Practical advances accelerated in the 2020s, with Google demonstrating in 2024 a distance-7 surface code logical qubit on their Willow processor using 101 physical qubits, achieving a logical error rate of 0.143% ± 0.003% per cycle below the surface code threshold, with a suppression factor of 2.14 ± 0.02 when increasing code distance by 2. In November 2025, IBM announced new quantum processors with increased qubit connectivity, enabling execution of error-corrected circuits with 30% more complexity than previous generations, advancing toward fault-tolerant systems by 2029. Emerging trends in the 2020s integrate artificial intelligence into error control, particularly for 6G networks, where machine learning is explored to enhance decoding of LDPC and polar codes. These developments, including AI-driven adaptive schemes, are being investigated in 3GPP Release 18 and 19 standardization efforts for beyond-5G reliability as of 2025.

Core Principles

Hamming Distance

In coding theory, the Hamming distance between two equal-length strings uu and vv over an alphabet, denoted d(u,v)d(u, v), is defined as the number of positions at which the corresponding symbols differ. This metric quantifies the dissimilarity between codewords and serves as a fundamental measure for assessing the robustness of error-correcting codes against transmission errors. For instance, consider the binary strings 000000 and 111111; their Hamming distance is d(000,111)=3d(000, 111) = 3, as all three positions differ. Similarly, for 101101 and 110110, the distance is d(101,110)=2d(101, 110) = 2, reflecting differences in the second and third positions. These examples illustrate how the distance captures the minimal changes needed to transform one string into another, which is crucial for identifying erroneous receptions. The minimum Hamming distance dd of a code—the smallest distance between any two distinct codewords—plays a pivotal role in distinguishing valid codewords from those corrupted by errors. In a channel prone to errors, such as the binary symmetric channel, an erroneous version of a transmitted codeword lies within a Hamming ball of radius equal to the number of errors around the original; a sufficiently large minimum distance ensures that these balls around different codewords do not overlap for small error counts, enabling unique identification and recovery of the intended codeword. This separation property underpins the design of codes that can reliably detect or correct errors by mapping received words back to the nearest valid codeword. A key performance limit derived from the Hamming distance is the Hamming bound, also called the sphere-packing bound, which establishes an upper bound on the maximum number of codewords A(n,d)A(n, d) in a binary code of length nn and minimum distance dd. The bound is given by A(n,d)2nk=0t(nk),A(n, d) \leq \frac{2^n}{\sum_{k=0}^t \binom{n}{k}}, where t=d12t = \left\lfloor \frac{d-1}{2} \right\rfloor represents the error-correction capability. This inequality arises from the non-overlapping packing of Hamming spheres of radius tt around each codeword within the entire space of 2n2^n possible strings, providing a theoretical ceiling on code efficiency. The Hamming distance concept generalizes naturally to qq-ary codes, where the alphabet has qq symbols (e.g., q=4q=4 for quaternary codes). In this setting, d(u,v)d(u, v) remains the count of positions where uu and vv differ, but the total space size becomes qnq^n, and the sphere-packing bound adjusts accordingly to A(n,d)qn/k=0t(nk)(q1)kA(n, d) \leq q^n / \sum_{k=0}^t \binom{n}{k} (q-1)^k. This extension is essential for non-binary applications, such as multilevel signaling in communications, where larger alphabets enhance data rates while maintaining error resilience through distance-based separation.

Detection and Correction Capabilities

The error detection and correction capabilities of a code are fundamentally determined by its minimum Hamming distance dd, which is the smallest number of positions in which any two distinct codewords differ. A code with minimum distance dd can detect up to d1d-1 errors, as any change of fewer than dd symbols in a codeword cannot result in another valid codeword, allowing the receiver to identify that an error has occurred. For error correction, such a code can reliably correct up to t=d12t = \left\lfloor \frac{d-1}{2} \right\rfloor errors. This follows from the sphere-packing argument: around each codeword, consider a Hamming sphere of radius tt; these spheres are disjoint because the distance between any two codewords exceeds 2t2t, ensuring that any received word within distance tt of a codeword belongs uniquely to that codeword's sphere and can be decoded to it via nearest-neighbor decoding. An important upper bound on the minimum distance is given by the Singleton bound, which states that for a code of length nn and dimension kk (where kk is the number of information symbols), dnk+1d \leq n - k + 1. Codes achieving equality in this bound are known as maximum distance separable (MDS) codes, such as Reed-Solomon codes, which maximize error correction capability for a given rate. These capabilities come with inherent trade-offs: increasing dd to enhance detection and correction reduces the code rate R=k/nR = k/n, as more (parity symbols) is required to separate codewords farther apart, limiting the information throughput. In erasure channels, where error positions are known to the receiver, a code with minimum distance dd can correct up to d1d-1 erasures, since the known positions allow solving for the missing symbols using the remaining reliable ones without ambiguity.

Error Detection Methods

Parity Checks

Parity checks represent one of the simplest and earliest methods for error detection in digital systems, involving the addition of a redundant bit to a block of data to verify its integrity. This technique relies on the concept of parity, which refers to whether the number of 1s in a binary string is even or odd. By appending a parity bit, the sender ensures that the total number of 1s in the transmitted data meets a predefined parity rule, allowing the receiver to detect discrepancies indicative of errors. In even parity, the parity bit is set to 0 if the data already has an even number of 1s, or to 1 if it has an odd number, resulting in an even total count of 1s across the data and parity bit. Conversely, odd parity sets the bit to achieve an odd total number of 1s. The parity bit is typically computed as the exclusive-OR (XOR) of all data bits, where even parity uses the direct XOR result and odd parity inverts it. This method detects any odd number of bit flips in the transmitted data, as an even parity scheme will yield an odd count (or vice versa) upon reception if an odd number of errors occurred. However, it fails to detect an even number of errors, such as two bit flips, which would preserve the overall parity. For example, consider the 4-bit data word 1011, which has three 1s (odd count). To apply even parity, the sender computes the XOR of the bits (1 XOR 0 XOR 1 XOR 1 = 1) and sets the parity bit to 1 to make the total even, yielding the 5-bit transmission 10111. At the receiver, recounting the 1s (four) confirms even parity, indicating no error if transmission was intended to be even. If a single bit flips, say the last data bit to 10101 (with parity 1, total three 1s), the receiver detects odd parity and identifies an error. This single-bit parity is commonly applied to bytes in memory systems for basic single-error detection. To enhance detection capabilities, particularly for burst errors—consecutive bit flips—block parity, also known as two-dimensional (2D) parity, extends the concept by organizing data into a matrix and adding parity bits for both rows and columns. In this scheme, each row of data bits receives a row parity bit (computed via XOR as in single-bit parity), and an additional column of parity bits is calculated for the rows, including a parity bit for the row parity bits themselves. This structure can detect all single-bit errors, double-bit errors, and burst errors up to length 2, as errors in a row or column will mismatch the corresponding parity. For instance, in a 4x4 data matrix, the resulting 5x5 block (with parities) allows identification of error locations through mismatched row and column parities, though it provides only detection, not correction. Despite its simplicity and low overhead (typically 12.5% for byte-level parity), the parity check method has notable limitations, including its inability to detect even-numbered errors and vulnerability to undetected bursts longer than the scheme's capability. It has been widely used in early computer memory for single-bit error detection and in foundational RAID configurations to identify disk failures through parity inconsistencies, though modern systems often supplement or replace it with more robust techniques like checksums for multi-byte validation.

Checksums

Checksums are simple arithmetic methods used to detect errors in blocks of data by appending a computed value that allows the receiver to verify integrity. These techniques compute a fixed-size value, typically 8 or 16 bits, based on the sum or exclusive-OR of data words, providing a lightweight alternative to more complex codes for applications where correction is not required. The Internet Checksum, specified in RFC 1071, is a widely adopted 16-bit checksum used in protocols like IP and TCP. It operates by dividing the data into 16-bit words and computing their one's complement sum, handling any odd-length data by padding with a zero byte. The process involves adding the words with end-around carry, where any carry out of the 16th bit is added back to the least significant bit. The checksum is then the one's complement (bitwise NOT) of this sum, ensuring that the sum of the data words and the checksum equals all 1s (0xFFFF) in one's complement arithmetic. Mathematically, if ss is the one's complement sum of the 16-bit words, the checksum cc is given by: c=(smod216)c = \sim (s \mod 2^{16}) where \sim denotes bitwise complement. This method detects all single-bit errors and all odd-numbered bit errors, as any such change alters the sum modulo 2162^{16}. Another common checksum variant is the Longitudinal Redundancy Check (LRC), employed in protocols like IBM's Binary Synchronous Communications (BISYNC). LRC computes an 8-bit value by taking the bitwise XOR of all bytes in the data block, effectively providing a vertical parity across the entire message. This results in a single check byte appended to the block, which the receiver recomputes and compares to detect discrepancies. LRC is particularly suited for character-oriented transmissions, complementing per-character parity checks. Despite their efficiency, checksums have notable limitations in error detection. They perform poorly against burst errors longer than the checksum size, as error patterns can cancel out during summation or XOR, allowing some bursts of 16 bits or more to go undetected in the Internet Checksum. Additionally, they are vulnerable to systematic errors, such as those from hardware faults that consistently flip specific bits, where the probability of undetected errors approaches 1/2161/2^{16} for random 2-bit errors in balanced data patterns. For random independent bit errors in long messages, the Internet Checksum detects over 99.99% of errors, but this drops in adversarial or correlated scenarios. As a more robust alternative for burst-prone channels, polynomial-based methods like CRC offer superior detection. In practice, the Internet Checksum appears in IPv4 headers and TCP segments, where it verifies the header integrity against transmission errors without requiring retransmission for minor issues. LRC finds use in legacy synchronous protocols for block-level validation. These checksums prioritize computational simplicity, making them ideal for real-time networking despite their constraints.

Cyclic Redundancy Checks

Cyclic redundancy checks (CRCs) are a family of error-detecting codes derived from cyclic codes, utilizing polynomial division over the finite field GF(2) to append a fixed-length checksum to data blocks for detecting transmission errors. Invented by W. Wesley Peterson and D. T. Brown, CRCs treat the binary message as coefficients of a polynomial M(x)M(x) of degree less than kk, where kk is the message length in bits. To compute the CRC, the message polynomial is shifted left by mm bits, equivalent to multiplying by xmx^m, where mm is the degree of the generator polynomial g(x)g(x). This augmented polynomial is then divided by g(x)g(x) using modulo-2 arithmetic, and the remainder R(x)R(x), a polynomial of degree less than mm, serves as the CRC bits appended to the original message. The resulting codeword polynomial C(x)=M(x)xm+R(x)C(x) = M(x) \cdot x^m + R(x) is exactly divisible by g(x)g(x), ensuring that any received codeword divisible by g(x)g(x) is assumed error-free. A common example is the CRC-16 scheme, which uses the generator polynomial g(x)=x16+x15+x2+1g(x) = x^{16} + x^{15} + x^2 + 1. This polynomial enables detection of all burst errors of length less than or equal to 16 bits, as any such error pattern corresponds to a nonzero polynomial of degree at most 15, which cannot be divisible by g(x)g(x) unless the error is zero. CRCs exhibit strong detection properties: they detect all single-bit errors due to the nonzero constant term in g(x)g(x); all double-bit errors if g(x)g(x) has at least two terms; and all odd-number of bit errors if x+1x+1 divides g(x)g(x), as is the case for the CRC-16 polynomial since g(1)=0mod2g(1) = 0 \mod 2. Many standard CRC polynomials, including CRC-16 variants, achieve a minimum Hamming distance of 4 for typical message lengths up to several thousand bits, allowing detection of all error patterns with up to three bit flips. In hardware, CRC computation is efficiently realized using a linear feedback shift register (LFSR) of length mm, where XOR gates are positioned at taps corresponding to the nonzero coefficients of g(x)g(x) (excluding the leading term). The message bits are serially fed into the register, which performs the equivalent of polynomial division through successive shifts and conditional XOR operations, producing the CRC at the end. This shift-register architecture enables high-speed processing in digital circuits, such as those in network interfaces and storage controllers. CRCS are extensively applied in data storage systems and communication protocols to ensure integrity against bursty errors common in physical media.

Hash-Based Methods

Hash-based methods employ cryptographic hash functions to verify data integrity by generating a fixed-size digest from variable-length input, enabling the detection of any alterations during transmission or storage. These functions are designed to be one-way, meaning it is computationally infeasible to reverse the digest to recover the original data, and they exhibit strong collision resistance to prevent finding two different inputs that produce the same output. A prominent example is SHA-256, part of the Secure Hash Algorithm family standardized by NIST, which produces a 256-bit digest and is widely used to detect changes in messages by comparing the computed digest of received data against the expected value. If the digests differ, an error or tampering is indicated. The function's avalanche effect ensures that even a single-bit change in the input typically alters approximately half of the output bits, enhancing its sensitivity to errors. For authenticated error detection, digital signatures combine hashing with public-key cryptography, such as RSA. The sender hashes the message and encrypts the digest using their private key to create the signature, which the receiver verifies by decrypting with the sender's public key and comparing it to a freshly computed hash of the received message, confirming both integrity and origin. This approach, detailed in PKCS#1 specifications, provides non-repudiation alongside detection. Despite their robustness, cryptographic hashes like SHA-256 are often overkill for pure accidental error detection, as their emphasis on collision resistance against adversarial attacks incurs higher computational overhead compared to lightweight precursors like checksums, which suffice for non-malicious scenarios. They excel at detecting any change but do not locate or correct errors. In practice, the Hash-based Message Authentication Code (HMAC), which applies a hash function to the message concatenated with a secret key, is used for integrity verification in secure protocols. For instance, HMAC-SHA256 in TLS 1.2 computes authentication tags on record payloads to flag modifications or replays, ensuring reliable data transmission over networks.

Error Correction Methods

Automatic Repeat Request

Automatic Repeat Request (ARQ) is an error control technique employed in data communication systems to ensure reliable transmission over potentially noisy channels by detecting errors and prompting the retransmission of affected data units through a feedback mechanism. This method relies on acknowledgments (ACKs) from the receiver to confirm successful reception or negative acknowledgments (NACKs) to signal errors, enabling the sender to resend only the problematic frames. ARQ protocols integrate error detection schemes, such as cyclic redundancy checks (CRC) or checksums, to identify corrupted frames at the receiver, which then triggers a NACK or the absence of an ACK, initiating retransmission. These detection methods append redundant bits to the data frame, allowing the receiver to verify integrity without decoding the full content. The primary variants of ARQ include stop-and-wait, Go-Back-N, and Selective Repeat protocols, each balancing simplicity, efficiency, and complexity differently. In stop-and-wait ARQ, the sender transmits one frame and halts until receiving an ACK or detecting a timeout, ensuring basic reliability but limiting throughput due to idle periods. The throughput efficiency for this protocol is expressed as η=1Pe1+2a\eta = \frac{1 - P_e}{1 + 2a} where PeP_e represents the frame error probability and aa is the propagation delay factor, defined as the ratio of round-trip propagation time to frame transmission time; this formula highlights how errors and delays degrade performance. Go-Back-N ARQ extends efficiency by allowing the sender to transmit a window of up to N unacknowledged frames continuously. Upon error detection via CRC or checksum, the receiver issues a NACK for the faulty frame, prompting the sender to retransmit that frame and all subsequent ones in the window, discarding correctly received frames after the error to simplify processing. Selective Repeat ARQ optimizes bandwidth usage by permitting the receiver to buffer out-of-sequence frames and request retransmission solely for the erroneous or lost ones, using sequence numbers and ACKs to manage reordering. This approach requires more sophisticated buffering at both ends but minimizes unnecessary retransmissions compared to Go-Back-N. Notable implementations of ARQ appear in the High-Level Data Link Control (HDLC) protocol, standardized by ISO, which supports both Go-Back-N and Selective Repeat modes for point-to-point and multipoint links, employing CRC-16 or CRC-32 for error detection. Similarly, ARQ mechanisms are integral to IEEE 802.11 Wi-Fi standards, where optional block ACKs enable selective retransmissions of failed MAC protocol data units (MPDUs) to enhance reliability in wireless environments. Despite their effectiveness, ARQ protocols introduce drawbacks, particularly increased latency on links with substantial propagation delays, as the sender must await feedback before proceeding, amplifying round-trip times in satellite or long-haul networks. Additionally, ARQ mandates a bidirectional feedback channel, which must itself be robust against errors to avoid compounding reliability issues.

Forward Error Correction Basics

Forward error correction (FEC) is a technique that enables the receiver to detect and correct errors in transmitted data without requiring retransmission from the sender. The sender encodes the original message by adding redundant information, creating a longer codeword that includes both the original data and parity bits. The receiver then uses this redundancy to identify and repair errors that occur during transmission over noisy channels. In FEC, an (n, k) block code is commonly employed, where k bits of information are encoded into n bits of codeword, with n > k to incorporate redundancy. The code rate is defined as R=kn<1R = \frac{k}{n} < 1, representing the fraction of useful data in the transmitted stream; lower rates provide more redundancy and thus greater error-correcting capability but at the cost of reduced throughput. Such codes can correct up to t errors per block, where t is determined by the minimum Hamming distance d_min of the code, specifically t=dmin12t = \left\lfloor \frac{d_{\min} - 1}{2} \right\rfloor. Decoding in FEC can be performed using hard-decision or soft-decision methods. Hard-decision decoding treats received symbols as discrete decisions (e.g., 0 or 1 based on a threshold), simplifying but discarding or reliability from the channel. In contrast, soft-decision decoding utilizes probabilistic metrics, such as log-likelihood ratios (LLRs), which quantify the confidence in each received symbol's value, enabling more accurate error correction at the expense of increased . The theoretical foundation of FEC is rooted in information theory, particularly the Shannon limit for the binary symmetric channel (BSC). For a BSC with crossover probability p, the channel capacity is C=1H(p)C = 1 - H(p), where H(p)=plog2p(1p)log2(1p)H(p) = -p \log_2 p - (1-p) \log_2 (1-p) is the binary entropy function. FEC codes can approach this limit, achieving arbitrarily low error rates as long as the code rate R ≤ C. A simple illustrative example of FEC is the repetition code, where each information bit is repeated multiple times (e.g., 0 encoded as 000, 1 as 111); the receiver applies majority voting to decode, correcting errors if fewer than half the repetitions are flipped. FEC is particularly advantageous in unidirectional communication links, such as satellite broadcasts, where feedback for retransmission is impractical or impossible.

Hybrid Schemes

Hybrid schemes integrate (ARQ) mechanisms with (FEC) to enhance reliability and efficiency in error-prone channels by correcting errors where possible and requesting retransmissions only when necessary. This combination mitigates the limitations of pure ARQ, which relies solely on retransmissions, and pure FEC, which lacks feedback for adaptation. The primary implementation is hybrid ARQ (HARQ), which employs FEC encoding on transmitted packets and uses feedback to trigger retransmissions if decoding fails. HARQ is classified into three types based on retransmission strategies. Type I applies FEC to the entire packet and, upon error detection via (CRC), discards the packet and requests a full retransmission of the same encoded version, without combining prior receptions. Type II uses incremental redundancy, where initial transmission sends a high-rate code, and retransmissions provide additional parity bits that, when combined with previous receptions, form a lower-rate code for improved decoding. Type III employs chase combining with self-decodable codes, retransmitting the same coded packet (possibly punctured differently) and combining multiple identical versions to boost via soft combining. These schemes reduce the number of retransmissions compared to pure ARQ by leveraging FEC for initial corrections, thereby improving throughput in time-varying channels like wireless environments. In LTE, HARQ utilizes and supports up to four retransmissions per transport block (totaling five transmissions). In , LDPC codes are employed with configurable HARQ processes (typically up to 16), both approaching Shannon capacity limits for enhanced . The added decoding complexity from combining receptions increases computational load but yields lower latency than pure ARQ, as fewer full retransmissions are needed. Performance evaluations demonstrate that HARQ achieves bit error rate (BER) reductions by factors of up to 10^3 over pure ARQ in mobile fading channels, owing to the synergistic error correction and feedback.

Notable Error-Correcting Codes

Linear Block Codes

Linear block codes form a fundamental class of error-correcting codes in , defined as linear subspaces of the Fqn\mathbb{F}_q^n over a Fq\mathbb{F}_q with qq elements, where nn denotes the code length. A linear [n,k,d]q[n, k, d]_q consists of qkq^k codewords, each of length nn, and has minimum distance dd, enabling correction of up to t=(d1)/2t = \lfloor (d-1)/2 \rfloor errors. These codes are particularly useful because their linearity allows efficient encoding and decoding operations via matrix algebra. In systematic form, a linear block code can be represented using a GG of k×nk \times n whose rows form a basis for the subspace, and a parity-check matrix HH of (nk)×n(n-k) \times n such that the codewords cc satisfy cHT=0c H^T = 0. Encoding a vector mFqkm \in \mathbb{F}_q^k produces the codeword c=mGc = m G. For binary codes (q=2q=2), GG often places the kk symbols in the first kk positions, with the remaining nkn-k parity symbols computed to satisfy the parity-check equations. This structure ensures that the rate is k/nk/n, balancing against information density. Decoding relies on syndrome computation for error detection and correction. Given a received vector r=c+er = c + e, where ee is the vector, the is s=rHTs = r H^T. If s=0s = 0, no is assumed, and rr is the codeword. Otherwise, ss identifies a leader (a low-weight ) via lookup in a syndrome table, allowing correction by subtracting the estimated ee from rr. This method's efficiency stems from the linearity, as syndromes depend only on the . A prominent example is the , a binary [7,4,3][7,4,3] linear that corrects single errors (t=1t=1). Its parity-check matrix HH has columns consisting of all nonzero binary vectors of length 3, ensuring d=3d=3. Introduced by to address errors in early systems, this code adds three parity bits to four message bits, with positions numbered in binary for parity grouping. BCH codes extend this framework for multiple-error correction, designed as cyclic linear over F2m\mathbb{F}_{2^m} capable of correcting up to tt errors in blocks of length n=2m1n = 2^m - 1. Constructed using a primitive of degree mm to define the field and a generator whose roots are consecutive powers of a primitive element, BCH codes achieve designed distance at least 2t+12t+1. Independently developed by Hocquenghem and by Bose and Ray-Chaudhuri, these codes are widely used for their bounded length and error-correcting capability. Reed-Solomon (RS) codes are another important class of linear , defined over a Fq\mathbb{F}_q with qnq \geq n, typically q=2mq = 2^m, as evaluation codes where codewords are evaluations of of degree less than kk at nn distinct points. An RS [n,k,d=nk+1]q[n, k, d = n - k + 1]_q code can correct up to t=(nk)/2t = \lfloor (n - k)/2 \rfloor symbol errors and detect up to nkn - k erasures, making them particularly effective against burst errors since errors affect entire symbols. Encoding involves interpolating a through the kk message symbols and evaluating at additional points for parity symbols. Decoding uses algorithms like Berlekamp-Massey for syndrome-based error location or Forney for magnitude estimation. Introduced by Irving S. Reed and Gustave in 1960, RS codes are extensively applied in digital storage (e.g., CDs using [31,26,6] over GF(8) to correct 2.5 symbols) and communications (e.g., QR codes, satellite links).

Convolutional Codes

Convolutional codes constitute a class of error-correcting codes tailored for sequential data streams, where encoding occurs continuously rather than in fixed blocks. Introduced by Peter Elias in as a method to approach with structured codes, they generate redundant bits through a convolutional process that mixes current and past input symbols. Unlike , convolutional codes process information in a streaming fashion, making them ideal for real-time applications like wireless communications and satellite links. The structure of a convolutional encoder relies on a , where the connections are defined by generator polynomials over GF(2). The key parameters include the constraint length KK, which represents the number of stages in the (including the current input bit), determining the memory depth and the number of possible states 2K12^{K-1}, and the code rate k/nk/n, indicating that kk input bits yield nn output bits per cycle, typically with k=1k=1 for simplicity. Encoding proceeds by shifting input bits through the register and computing each output bit as the modulo-2 sum of selected register contents, as specified by the generators; for instance, a rate 1/21/2 code with K=3K=3 might use generators g0=1112g_0 = 111_2 and g1=1012g_1 = 101_2, producing two output bits per input. This process forms a semi-infinite codeword, often terminated by flushing the register with zeros to return to the zero state. The trellis diagram illustrates the encoding as a time-indexed graph of states, with branches labeled by input-output pairs, enabling visualization of all possible code sequences. Decoding convolutional codes employs the , developed by in 1967, which achieves maximum likelihood sequence estimation via dynamic programming on the trellis. At each time step, the algorithm computes branch metrics—such as for hard-decision decoding or for soft-decision variants—between received symbols and possible branch outputs, then selects and retains the lowest-metric survivor path to each state, pruning suboptimal paths. The complexity scales as O(2K)O(2^K) operations per input bit, dominated by the exponential growth in states, though practical implementations limit KK to around 7-9 for feasibility. Soft decoding, which incorporates bit reliability from the channel, can improve performance by up to 2-3 dB but is briefly noted here as it extends into more advanced concatenated schemes. To achieve higher rates without redesigning the encoder, puncturing deletes specific output bits according to a predefined , effectively increasing the rate from a low-rate mother code while preserving decoding compatibility via the full trellis. This technique, formalized by , , and Geist in 1979, allows rates like 7/87/8 from a rate 1/21/2 base code by omitting one bit every four outputs, for example, and maintains near-optimal performance for moderate puncturing densities. Prominent applications include the rate 1/21/2, constraint length K=7K=7 standardized by for the in 1977, concatenated with Reed-Solomon for deep-space and decoded via Viterbi to yield a 3.5 dB coding gain at 10510^{-5} . In cellular networks, the Global System for Mobile Communications () utilizes rate 1/21/2, K=5K=5 convolutional codes with generators G0=1+D3+D4G_0 = 1 + D^3 + D^4 and G1=1+D+D3+D4G_1 = 1 + D + D^3 + D^4 for full-rate speech channel coding, where the protected 189 bits (including 182 Class 1 bits, 3 parity bits, and 4 bits) are encoded into 378 bits, to which the 78 unprotected Class 2 bits are added for a total of 456 bits. These examples underscore the codes' robustness in bandwidth-constrained, error-prone environments.

Advanced Codes

Advanced codes represent a class of error-correcting codes designed to approach the theoretical limits of channel capacity, particularly the Shannon limit, through sophisticated constructions and decoding algorithms. These codes, including turbo, low-density parity-check (LDPC), and polar codes, leverage iterative processing and graphical representations to achieve near-optimal performance in practical systems. Unlike earlier linear block or convolutional codes, advanced codes emphasize sparsity, concatenation, and polarization to minimize decoding complexity while maximizing error correction capability. Turbo codes consist of parallel concatenated convolutional codes, where two or more systematic convolutional encoders process the input data, separated by an interleaver to randomize the sequence and decorrelate errors between the encoders. This structure produces parity bits from both encoders, enabling iterative decoding that exchanges soft information between component decoders to refine estimates of the transmitted bits. The interleaver plays a critical role in preventing burst errors from propagating across iterations, allowing the decoder to converge on highly reliable decisions after several passes. were introduced as a breakthrough in achieving performance close to the Shannon limit with feasible computational resources. LDPC codes are defined by a sparse parity-check matrix, where the low density of nonzero entries (typically three to six per column) forms a that models the code constraints. Decoding relies on , specifically the sum-product algorithm, which iteratively passes messages along the graph edges to compute marginal probabilities for each bit, updating beliefs until convergence or a maximum count is reached. This graphical approach exploits the sparsity to keep linear in length, making LDPC codes scalable for high-rate applications. The originated from early work on probabilistic decoding but gained prominence with optimized irregular degree distributions that further enhance performance. Polar codes exploit the phenomenon of channel polarization, where recursive application of a basic kernel (such as the 2x2 Arikan kernel) transforms a set of identical noisy channels into a subset of nearly perfect channels and a complementary subset of noisy ones. Information bits are placed on the reliable (polarized) channels, while frozen bits (known constants) occupy the unreliable ones, enabling capacity-achieving performance in the asymptotic limit. The standard successive cancellation (SC) decoder processes bits sequentially, using decisions on prior bits to estimate likelihoods for subsequent ones, which achieves the code's designed capacity but can be enhanced by SC list decoding. In SC list decoding, the decoder maintains a list of candidate paths during the successive process, selecting the most likely one at the end to improve error rates at finite lengths. Polar codes were proven to attain symmetric capacity for binary-input discrete memoryless channels. These advanced codes demonstrate remarkable performance, operating within 0.5 dB of the Shannon limit at a (BER) of 10^{-5} for typical rates around 1/2 to 8/9, as validated through simulations and theoretical bounds. This proximity to the theoretical limit underscores their efficiency in bandwidth-constrained environments. In practical deployments, LDPC codes form a mandatory component of the IEEE 802.11ax () standard for high-throughput wireless local area networks, providing robust error correction in multipath fading channels. Similarly, polar codes are specified by for control channels in New Radio (NR), handling short payloads with low latency requirements.

Applications

Telecommunications

In telecommunications, error detection and correction techniques are essential for maintaining across diverse transmission media, including , , and optical channels, where noise, interference, fading, and attenuation can degrade signal quality. These methods employ both detection mechanisms, such as checksums, and correction strategies, including (FEC) codes and (ARQ) protocols, to achieve low bit error rates (BER) suitable for high-reliability applications like voice, video, and data services. The choice of technique depends on channel characteristics, latency constraints, and bandwidth availability, with hybrid approaches often combining detection and correction for optimal performance. In deep-space communications, concatenated coding schemes using Reed-Solomon (RS) outer codes and form the basis of the Consultative Committee for Space Data Systems (CCSDS) standard for . This standard specifies a constraint length-7, rate-1/2 paired with a (255,223) RS code over GF(2^8), which corrects up to 16 symbol errors per block to combat burst errors and low signal-to-noise ratios inherent in long-distance space links. The Voyager spacecraft missions exemplified this approach, employing the (255,223) RS code concatenated with a rate-1/2 to achieve a post-decoding BER as low as 10^{-5}, enabling reliable transmission of scientific data from billions of kilometers away despite weak signals. Satellite and broadcast television systems leverage low-density parity-check (LDPC) codes for efficient FEC in standards like , which supports a range of code rates from 1/4 to 9/10 to balance error correction capability with . In , LDPC inner codes are concatenated with BCH outer codes, particularly for modulation schemes like QPSK, allowing robust performance in noisy satellite channels with fading and interference. This configuration enables high-throughput video distribution while maintaining low BER, with code rates adjustable based on link conditions to support services from direct-to-home broadcasting to mobile satellite communications. Internet protocols integrate error control at the , where TCP employs ARQ with mandatory checksums to detect and retransmit corrupted segments, ensuring reliable end-to-end delivery over potentially lossy networks. In contrast, UDP provides optional checksums for basic error detection but supports FEC extensions, such as parity packets in RTP, for real-time applications like streaming where low latency precludes full ARQ. These mechanisms complement lower-layer error handling in IP networks, adapting to varying channel qualities in wired and . Modern cellular networks, including 5G New Radio (NR), utilize polar codes for control channels and LDPC codes for data channels to provide scalable FEC tailored to enhanced mobile broadband requirements. Polar codes offer low-complexity decoding for short blocks in control signaling, while LDPC codes support high-throughput data transmission with flexible rates and lengths. 5G NR further enhances reliability through hybrid ARQ (HARQ) with up to 16 processes per UE, enabling incremental redundancy retransmissions to adaptively correct errors without excessive latency. Optical fiber systems for long-haul and transatlantic links rely on FEC to mitigate accumulated and dispersion over thousands of kilometers, typically achieving a post-FEC BER of 10^{-15} with approximately 7% overhead. Standards like RS(255,239) codes, used in networks, add redundancy to correct pre-FEC BER up to 10^{-3}, extending reach and capacity for high-speed data transport in systems like 100G Ethernet over dense . This overhead enables error-free performance across global undersea cables, supporting the backbone of international .

Data Storage

Error detection and correction are essential in systems to ensure against physical imperfections, environmental factors, and wear over time. In magnetic, optical, and devices, specialized coding schemes mitigate bit errors that arise during read/write operations or due to media degradation. These techniques typically combine detection mechanisms, such as cyclic redundancy checks (CRC) in sector headers for initial error flagging, with correction codes that recover data without user intervention. In hard disk drives (HDDs), run-length limited (RLL) codes constrain the encoding of data to optimize magnetic recording density while aiding error detection by limiting consecutive zero transitions, often paired with low-density parity-check (LDPC) codes for robust correction. Error-correcting codes (ECC) in HDDs are designed to detect and correct sector errors, typically handling up to 100 bits per 512-byte sector through on-the-fly processing by the drive controller. This capability addresses random bit flips from magnetic interference or head misalignment, ensuring reliable retrieval even as areal densities exceed 1 Tb/in². Optical storage media, such as compact discs (CDs) and digital versatile discs (DVDs), employ cross-interleaved Reed-Solomon (CIRC) coding to combat burst errors caused by scratches or fingerprints on the disc surface. The CIRC scheme uses two layers of shortened Reed-Solomon codes with deliberate interleaving to spread errors across frames, enabling correction of burst lengths up to approximately 3.5 mm (or 3,500 bits) while detecting longer ones for or concealment in audio/video playback. This design, standardized for CD-DA and extended to DVD formats, provides a raw reduction to below 10^{-12} post-correction, prioritizing playability over perfect . NAND flash memory in solid-state drives (SSDs) relies on Bose-Chaudhuri-Hocquenghem (BCH) or LDPC codes for page-level error correction, addressing cell-to-cell interference and charge leakage that increase with scaling to multi-level cells (MLC, TLC, QLC). BCH codes offer guaranteed correction of up to 40-60 bits per 1-4 KB page in enterprise TLC NAND, while LDPC provides superior performance in high-noise scenarios through iterative soft-decision decoding. Wear-leveling algorithms integrate tracking by monitoring program/erase (P/E) cycle counts and raw bit rates (RBER) per block, dynamically remapping data to underused regions to balance wear and preempt uncorrectable . Redundant array of independent disks () configurations enhance storage reliability at the system level by computing parity across multiple disks, enabling detection and correction of entire drive failures without . In RAID-5, for instance, a single parity stripe per block group allows reconstruction of data from a failed disk using XOR operations on surviving drives, tolerating one failure while distributing overhead evenly. Advanced RAID-6 variants extend this to two failures via dual parities, crucial for large-scale arrays where annual failure rates approach 1-2%. In the 2020s, NVMe SSDs have adopted soft-decision LDPC decoding to achieve unprecedented reliability, targeting an uncorrectable (UBER) of 10^{-16} or lower in enterprise environments. This trend supports exabyte-scale deployments in centers, where LDPC's ability to leverage multi-read voltage thresholds corrects over 100 bits per sector amid 3D NAND's rising error floors, often combined with host-side for end-to-end protection.

Computing Hardware

In computing hardware, error detection and correction mechanisms are essential for maintaining in systems like RAM and processor caches, where transient faults such as soft errors can occur due to environmental factors including cosmic rays. These soft errors, which flip individual bits in memory cells, arise primarily from high-energy particles striking , with an estimated rate of approximately 10^{-12} errors per bit per hour (1000–5000 FIT per Mbit) in DRAM under terrestrial conditions. Early motivations for such techniques trace back to the unreliability of vacuum tube-based computers in the , prompting to develop error-correcting codes to automate error handling during routine operations. Error-correcting code (ECC) memory, commonly implemented in server-grade systems, employs single-error correction double-error detection (SECDED) schemes derived from Hamming codes to protect 64-bit data words using 8 additional check bits, enabling correction of single-bit errors and detection of double-bit errors within the word. For instance, DDR4 ECC modules integrate these codes at the module level to safeguard against corruption during high-speed data transfers. In contrast, consumer-grade RAM typically lacks full ECC but may incorporate basic parity checking for error detection in select configurations, though modern non-ECC variants often forgo even this to prioritize cost and density. Within CPUs, cache hierarchies use simpler parity bits on tag fields to detect errors in address mapping, ensuring that faulty cache lookups trigger misses rather than silent data propagation. Graphics processing units (GPUs) and specialized accelerators for AI training demand higher-throughput error correction to handle massive parallel computations, where low-density parity-check (LDPC) codes are increasingly applied for their efficiency in correcting multiple s at scale. NVIDIA's H100 GPU, for example, incorporates ECC on its HBM3 to mitigate soft s during intensive AI workloads, supporting reliable training of large models without frequent interruptions. Server environments benefit significantly from ECC, as studies indicate that s contribute to about 50% of hardware-induced ; ECC implementation can reduce such incidents by automatically correcting transient faults, thereby enhancing overall system availability. As of 2025, advancements in chiplet-based architectures on 3nm processes, such as AMD's CDNA 4 for AI accelerators, integrate on-die ECC directly into compute and I/O chiplets to address increased error susceptibility from denser transistors and multi-die interconnects, ensuring robust error handling across heterogeneous designs.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.