Hubbry Logo
Information theoryInformation theoryMain
Open search
Information theory
Community hub
Information theory
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Information theory
Information theory
from Wikipedia

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was established and formalized by Claude Shannon in the 1940s,[1] though early contributions were made in the 1920s through the works of Harry Nyquist and Ralph Hartley. It is at the intersection of electronic engineering, mathematics, statistics, computer science, neurobiology, physics, and electrical engineering.[2][3]

A key measure in information theory is entropy. Entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process. For example, identifying the outcome of a fair coin flip (which has two equally likely outcomes) provides less information (lower entropy, less uncertainty) than identifying the outcome from a roll of a die (which has six equally likely outcomes). Some other important measures in information theory are mutual information, channel capacity, error exponents, and relative entropy. Important sub-fields of information theory include source coding, algorithmic complexity theory, algorithmic information theory and information-theoretic security.

Applications of fundamental topics of information theory include source coding/data compression (e.g. for ZIP files), and channel coding/error detection and correction (e.g. for DSL). Its impact has been crucial to the success of the Voyager missions to deep space,[4] the invention of the compact disc, the feasibility of mobile phones and the development of the Internet and artificial intelligence.[5][6][3] The theory has also found applications in other areas, including statistical inference,[7] cryptography, neurobiology,[8] perception,[9] signal processing,[2] linguistics, the evolution[10] and function[11] of molecular codes (bioinformatics), thermal physics,[12] molecular dynamics,[13] black holes, quantum computing, information retrieval, intelligence gathering, plagiarism detection,[14] pattern recognition, anomaly detection,[15] the analysis of music,[16][17] art creation,[18] imaging system design,[19] study of outer space,[20] the dimensionality of space,[21] and epistemology.[22]

Overview

[edit]

Information theory studies the transmission, processing, extraction, and utilization of information. Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a noisy channel, this abstract concept was formalized in 1948 by Claude Shannon in a paper entitled A Mathematical Theory of Communication, in which information is thought of as a set of possible messages, and the goal is to send these messages over a noisy channel, and to have the receiver reconstruct the message with low probability of error, in spite of the channel noise. Shannon's main result, the noisy-channel coding theorem, showed that, in the limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent.[8]

Coding theory is concerned with finding explicit methods, called codes, for increasing the efficiency and reducing the error rate of data communication over noisy channels to near the channel capacity. These codes can be roughly subdivided into data compression (source coding) and error-correction (channel coding) techniques. In the latter case, it took many years to find the methods Shannon's work proved were possible.[23][24]

A third class of information theory codes are cryptographic algorithms (both codes and ciphers). Concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis,[25] such as the unit ban.

Historical background

[edit]

The landmark event establishing the discipline of information theory and bringing it to immediate worldwide attention was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in the Bell System Technical Journal in July and October 1948. Historian James Gleick rated the paper as the most important development of 1948, noting that the paper was "even more profound and more fundamental" than the transistor.[26] He came to be known as the "father of information theory".[27][28][29] Shannon outlined some of his initial ideas of information theory as early as 1939 in a letter to Vannevar Bush.[29]

Prior to this paper, limited information-theoretic ideas had been developed at Bell Labs, all implicitly assuming events of equal probability. Harry Nyquist's 1924 paper, Certain Factors Affecting Telegraph Speed, contains a theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation W = K log m (recalling the Boltzmann constant), where W is the speed of transmission of intelligence, m is the number of different voltage levels to choose from at each time step, and K is a constant. Ralph Hartley's 1928 paper, Transmission of Information, uses the word information as a measurable quantity, reflecting the receiver's ability to distinguish one sequence of symbols from any other, thus quantifying information as H = log Sn = n log S, where S was the number of possible symbols, and n the number of symbols in a transmission. The unit of information was therefore the decimal digit, which since has sometimes been called the hartley in his honor as a unit or scale or measure of information. Alan Turing in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war Enigma ciphers.[citation needed]

Much of the mathematics behind information theory with events of different probabilities were developed for the field of thermodynamics by Ludwig Boltzmann and J. Willard Gibbs. Connections between information-theoretic entropy and thermodynamic entropy, including the important contributions by Rolf Landauer in the 1960s, are explored in Entropy in thermodynamics and information theory.[citation needed]

In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion:[30]

"The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point."

With it came the ideas of:

Quantities of information

[edit]

Information theory is based on probability theory and statistics, where quantified information is usually described in terms of bits. Information theory often concerns itself with measures of information of the distributions associated with random variables. One of the most important measures is called entropy, which forms the building block of many other measures. Entropy allows quantification of measure of information in a single random variable.[31]

Another useful concept is mutual information defined on two random variables, which describes the measure of information in common between those variables, which can be used to describe their correlation. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with the given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum rate of reliable communication across a noisy channel in the limit of long block lengths, when the channel statistics are determined by the joint distribution.

The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. A common unit of information is the bit or shannon, based on the binary logarithm. Other units include the nat, which is based on the natural logarithm, and the decimal digit, which is based on the common logarithm.

In what follows, an expression of the form p log p is considered by convention to be equal to zero whenever p = 0. This is justified because for any logarithmic base.

Entropy of an information source

[edit]

Based on the probability mass function of a source, the Shannon entropy H, in units of bits per symbol, is defined as the expected value of the information content of the symbols.[32][33]

The amount of information conveyed by an individual source symbol with probability is known as its self-information or surprisal, . This quantity is defined as:[34][35]

A less probable symbol has a larger surprisal, meaning its occurrence provides more information.[34] The entropy is the weighted average of the surprisal of all possible symbols from the source's probability distribution:[36][37]

Intuitively, the entropy of a discrete random variable X is a measure of the amount of uncertainty associated with the value of when only its distribution is known.[32] A high entropy indicates the outcomes are more evenly distributed, making the result harder to predict.[38]

For example, if one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, no information is transmitted. If, however, each bit is independently and equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted.[39]

The entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function, Hb(p). The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.

Properties

[edit]

A key property of entropy is that it is maximized when all the messages in the message space are equiprobable. For a source with n possible symbols, where for all , the entropy is given by:[40]

This maximum value represents the most unpredictable state.[36]

For a source that emits a sequence of symbols that are independent and identically distributed (i.i.d.), the total entropy of the message is bits. If the source data symbols are identically distributed but not independent, the entropy of a message of length will be less than .[41][42]

Units

[edit]

The choice of the logarithmic base in the entropy formula determines the unit of entropy used:[34][36]

  • A base-2 logarithm (as shown in the main formula) measures entropy in bits per symbol. This unit is also sometimes called the shannon in honor of Claude Shannon.[32]
  • A Natural logarithm (base e) measures entropy in nats per symbol. This is often used in theoretical analysis as it avoids the need for scaling constants (like ln 2) in derivations.[43]
  • Other bases are also possible. A base-10 logarithm measures entropy in decimal digits, or hartleys, per symbol.[33] A base-256 logarithm measures entropy in bytes per symbol, since 28 = 256.[44]

Binary Entropy Function

[edit]

The special case of information entropy for a random variable with two outcomes (a Bernoulli trial) is the binary entropy function. This is typically calculated using a base-2 logarithm, and its unit is the shannon.[45] If one outcome has probability p, the other has probability 1p. The entropy is given by:[46]

This function is depicted in the plot shown above, reaching its maximum of 1 bit when p = 0.5, corresponding to the highest uncertainty.

Joint entropy

[edit]

The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: (X, Y). This implies that if X and Y are independent, then their joint entropy is the sum of their individual entropies.

For example, if (X, Y) represents the position of a chess piece—X the row and Y the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.

Despite similar notation, joint entropy should not be confused with cross-entropy.

Conditional entropy (equivocation)

[edit]

The conditional entropy or conditional uncertainty of X given random variable Y (also called the equivocation of X about Y) is the average conditional entropy over Y:[47]

Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that:

Mutual information (transinformation)

[edit]

Mutual information measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of X relative to Y is given by:

where SI (Specific mutual Information) is the pointwise mutual information.

A basic property of the mutual information is that:

That is, knowing , we can save an average of I(X; Y) bits in encoding compared to not knowing .

Mutual information is symmetric:

Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the posterior probability distribution of given the value of and the prior distribution on :

In other words, this is a measure of how much, on the average, the probability distribution on will change if we are given the value of . This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:

Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.

Kullback–Leibler divergence (information gain)

[edit]

The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions: a "true" probability distribution , and an arbitrary probability distribution . If we compress data in a manner that assumes is the distribution underlying some data, when, in reality, is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined

Although it is sometimes used as a 'distance metric', KL divergence is not a true metric since it is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric).

Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number is about to be drawn randomly from a discrete set with probability distribution . If Alice knows the true distribution , while Bob believes (has a prior) that the distribution is , then Bob will be more surprised than Alice, on average, upon seeing the value of . The KL divergence is the (objective) expected value of Bob's (subjective) surprisal minus Alice's surprisal, measured in bits if the log is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him.

Directed Information

[edit]

Directed information, , is an information theory measure that quantifies the information flow from the random process to the random process . The term directed information was coined by James Massey and is defined as:

,

where is the conditional mutual information .

In contrast to mutual information, directed information is not symmetric. The measures the information bits that are transmitted causally[clarification needed] from to . The Directed information has many applications in problems where causality plays an important role such as capacity of channel with feedback,[48][49] capacity of discrete memoryless networks with feedback,[50] gambling with causal side information,[51] compression with causal side information,[52] real-time control communication settings,[53][54] and in statistical physics.[55]

Other quantities

[edit]

Other important information theoretic quantities include the Rényi entropy and the Tsallis entropy (generalizations of the concept of entropy), differential entropy (a generalization of quantities of information to continuous distributions), and the conditional mutual information. Also, pragmatic information has been proposed as a measure of how much information has been used in making a decision.

Coding theory

[edit]
A picture showing scratches on the readable surface of a CD-R. Music and data CDs are coded using error correcting codes and thus can still be read even if they have minor scratches using error detection and correction.

Coding theory is one of the most important and direct applications of information theory. It can be subdivided into source coding theory and channel coding theory. Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source.

  • Data compression (source coding): There are two formulations for the compression problem:
  • Error-correcting codes (channel coding): While data compression removes as much redundancy as possible, an error-correcting code adds just the right kind of redundancy (i.e., error correction) needed to transmit the data efficiently and faithfully across a noisy channel.

This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel) or intermediary "helpers" (the relay channel), or more general networks, compression followed by transmission may no longer be optimal.

Source theory

[edit]

Any process that generates successive messages can be considered a source of information. A memoryless source is one in which each message is an independent identically distributed random variable, whereas the properties of ergodicity and stationarity impose less restrictive constraints. All such sources are stochastic. These terms are well studied in their own right outside information theory.

Rate

[edit]

Information rate is the average entropy per symbol. For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is:

that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a process that is not necessarily stationary, the average rate is:

that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result.[56]

The information rate is defined as:

It is common in information theory to speak of the "rate" or "entropy" of a language. This is appropriate, for example, when the source of information is English prose. The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of source coding.

Channel capacity

[edit]

Communications over a channel is the primary motivation of information theory. However, channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality.

Consider the communications process over a discrete channel. A simple model of the process is shown below:

Here represents the space of messages transmitted, and the space of messages received during a unit time over our channel. Let p(y|x) be the conditional probability distribution function of given . We will consider p(y|x) to be an inherent fixed property of our communications channel (representing the nature of the noise of our channel). Then the joint distribution of and is completely determined by our channel and by our choice of f(x), the marginal distribution of messages we choose to send over the channel. Under these constraints, we would like to maximize the rate of information, or the signal, we can communicate over the channel. The appropriate measure for this is the mutual information, and this maximum mutual information is called the channel capacity and is given by:

This capacity has the following property related to communicating at information rate R (where R is usually bits per symbol). For any information rate R < C and coding error ε > 0, for large enough N, there exists a code of length N and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε; that is, it is always possible to transmit with arbitrarily small block error. In addition, for any rate R > C, it is impossible to transmit with arbitrarily small block error.

Channel coding is concerned with finding such nearly optimal codes that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity.

Capacity of particular channel models

[edit]
  • A continuous-time analog communications channel subject to Gaussian noise—see Shannon–Hartley theorem.
  • A binary symmetric channel (BSC) with crossover probability p is a binary input, binary output channel that flips the input bit with probability p. The BSC has a capacity of 1 − Hb(p) bits per channel use, where Hb is the binary entropy function to the base-2 logarithm:
  • A binary erasure channel (BEC) with erasure probability p is a binary input, ternary output channel. The possible channel outputs are 0, 1, and a third symbol 'e' called an erasure. The erasure represents complete loss of information about an input bit. The capacity of the BEC is 1 − p bits per channel use.

Channels with memory and directed information

[edit]

In practice many channels have memory. Namely, at time the channel is given by the conditional probability. It is often more comfortable to use the notation and the channel become . In such a case the capacity is given by the mutual information rate when there is no feedback available and the Directed information rate in the case that either there is feedback or not[48][57] (if there is no feedback the directed information equals the mutual information).

Fungible information

[edit]

Fungible information is the information for which the means of encoding is not important.[58] Classical information theorists and computer scientists are mainly concerned with information of this sort. It is sometimes referred as speakable information.[59]

Applications to other fields

[edit]

Intelligence uses and secrecy applications

[edit]

Information theoretic concepts apply to cryptography and cryptanalysis. Turing's information unit, the ban, was used in the Ultra project, breaking the German Enigma machine code and hastening the end of World War II in Europe. Shannon himself defined an important concept now called the unicity distance. Based on the redundancy of the plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.

Information theory leads us to believe it is much more difficult to keep secrets than it might first appear. A brute force attack can break systems based on asymmetric key algorithms or on most commonly used methods of symmetric key algorithms (sometimes called secret key algorithms), such as block ciphers. The security of all such methods comes from the assumption that no known attack can break them in a practical amount of time.

Information theoretic security refers to methods such as the one-time pad that are not vulnerable to such brute force attacks. In such cases, the positive conditional mutual information between the plaintext and ciphertext (conditioned on the key) can ensure proper transmission, while the unconditional mutual information between the plaintext and ciphertext remains zero, resulting in absolutely secure communications. In other words, an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key. However, as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse of key material.

Pseudorandom number generation

[edit]

Pseudorandom number generators are widely available in computer language libraries and application programs. They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software. A class of improved random number generators is termed cryptographically secure pseudorandom number generators, but even they require random seeds external to the software to work as intended. These can be obtained via extractors, if done carefully. The measure of sufficient randomness in extractors is min-entropy, a value related to Shannon entropy through Rényi entropy; Rényi entropy is also used in evaluating randomness in cryptographic systems. Although related, the distinctions among these measures mean that a random variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.

Seismic exploration

[edit]

One early commercial application of information theory was in the field of seismic oil exploration. Work in this field made it possible to strip off and separate the unwanted noise from the desired seismic signal. Information theory and digital signal processing offer a major improvement of resolution and image clarity over previous analog methods.[60]

Semiotics

[edit]

Semioticians Doede Nauta [nl] and Winfried Nöth both considered Charles Sanders Peirce as having created a theory of information in his works on semiotics.[61]: 171 [62]: 137  Nauta defined semiotic information theory as the study of "the internal processes of coding, filtering, and information processing."[61]: 91 

Concepts from information theory such as redundancy and code control have been used by semioticians such as Umberto Eco and Ferruccio Rossi-Landi [it] to explain ideology as a form of message transmission whereby a dominant social class emits its message by using signs that exhibit a high degree of redundancy such that only one message is decoded among a selection of competing ones.[63]

Integrated process organization of neural information

[edit]

Quantitative information theoretic methods have been applied in cognitive science to analyze the integrated process organization of neural information in the context of the binding problem in cognitive neuroscience.[64] In this context, either an information-theoretical measure, such as functional clusters (Gerald Edelman and Giulio Tononi's functional clustering model and dynamic core hypothesis (DCH)[65]) or effective information (Tononi's integrated information theory (IIT) of consciousness[66][67][68]), is defined (on the basis of a reentrant process organization, i.e. the synchronization of neurophysiological activity between groups of neuronal populations), or the measure of the minimization of free energy on the basis of statistical methods (Karl J. Friston's free energy principle (FEP), an information-theoretical measure which states that every adaptive change in a self-organized system leads to a minimization of free energy, and the Bayesian brain hypothesis[69][70][71][72][73]).

Miscellaneous applications

[edit]

Information theory also has applications in the search for extraterrestrial intelligence,[74] black holes,[75] bioinformatics,[76] and gambling.[77][78]

See also

[edit]

Applications

[edit]

History

[edit]

Theory

[edit]

Concepts

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Information theory is a mathematical discipline that quantifies, stores, and communicates information, focusing on the fundamental limits of data transmission and processing in the presence of . Founded by Claude E. Shannon through his seminal 1948 paper "", it treats information as a probabilistic entity, independent of its semantic meaning, and emphasizes engineering aspects of reliable message reproduction across communication systems. At its core, information theory introduces as the measure of uncertainty or average information content in a random source, defined for a discrete probability distribution as H=pilog2piH = -\sum p_i \log_2 p_i bits, where pip_i is the probability of each outcome. This concept underpins source coding theorems, which establish that data can be compressed to a length approaching the source entropy without loss, enabling efficient storage and transmission. Shannon's work also defines as the supreme mutual information over all input distributions, representing the highest rate at which information can be sent reliably over a noisy channel, as formalized in his . Key extensions include , which quantifies the shared information between two variables as I(X;Y)=H(X)+H(Y)H(X,Y)I(X;Y) = H(X) + H(Y) - H(X,Y), essential for analyzing dependencies in communication and data processing. The theory's impact spans , where it optimizes signal encoding and error correction; , influencing algorithms for compression and ; and broader fields like , physics, and , where models uncertainty in genetic codes, thermodynamic systems, and neural networks. Shannon's binary digit (bit) as the unit of information laid the groundwork for the digital age, transforming and enabling modern electronics.

Introduction and History

Overview

Information theory is a branch of and primarily concerned with the quantification, storage, and communication of information. It was pioneered by Claude E. Shannon in his seminal 1948 paper, which established a rigorous mathematical framework for analyzing information processes. The field focuses on understanding how information can be measured in probabilistic terms, independent of its semantic meaning, to enable efficient handling in technical systems. The core problems addressed by information theory include quantifying the inherent in messages or random events, devising efficient encoding techniques to minimize for storage and transmission, and designing methods to achieve reliable communication despite or interference in channels. These challenges are foundational to modern digital systems, where resources like bandwidth and storage are limited. For instance, is conceptually viewed as the reduction of : observing the outcome of a flip eliminates one bit of uncertainty, whereas specifying a result from a fair six-sided die requires log₂(6) ≈ 2.58 bits on average. Beyond its origins in engineering, information theory has exerted significant influence across interdisciplinary domains, including for algorithm design, for modeling genetic information flows, and physics for connections to . Central to the field is as a measure of uncertainty, with defining the maximum rate of reliable transmission.

Historical Development

The origins of information theory trace back to early 20th-century efforts to quantify signal transmission in . In 1924, published "Certain Factors Affecting Telegraph Speed" in the Technical Journal, where he derived a fundamental result establishing that a signal's maximum transmission rate is limited by twice the bandwidth, providing a foundational measure for the amount of information that could be sent over a channel without , later contributing to the . Building on this, Ralph Hartley introduced a quantitative measure of information in his 1928 paper "Transmission of Information" in the same journal, defining information as the logarithm of the number of possible distinguishable symbols, which laid the groundwork for logarithmic measures of uncertainty in communication systems. The field crystallized with Claude Shannon's seminal 1948 paper "," published in the Technical Journal, which formalized information theory by introducing concepts like as a measure of and as the maximum reliable transmission rate over noisy channels, fundamentally shaping modern digital communication. This work was popularized and extended beyond engineering to broader scientific audiences through Warren Weaver's collaboration in the 1949 book The Mathematical Theory of Communication, which emphasized the theory's implications for semantics and . In the , information theory expanded into computational foundations with the development of . Ray Solomonoff's 1960 technical report "A Preliminary Report on a General of Inductive " proposed measuring an object's by the length of the shortest program that generates it, formalizing inductive inference through . Independently, advanced this in his 1965 paper "Three Approaches to the Quantitative Definition of Information" in Problems of Information Transmission, and independently by in his 1966 paper "On the Length of Programs for Computing Finite Binary Sequences" in the Journal of the ACM, defining complexity as the minimal program length for computation, providing a non-probabilistic, algorithmic basis for information that influenced and studies. Quantum extensions emerged in the 1970s, with Alexander Holevo's 1973 paper "Bounds for the Quantity of Information Transmittable by a Quantum Communication Channel" in Problems of Information Transmission establishing the Holevo bound, which limits the classical information transmittable through quantum channels and founded quantum information theory. By 2000, Michael Nielsen and Isaac Chuang's textbook systematized quantum entropy and related measures, integrating classical information theory with for applications in . In the late and , information theory intersected with , notably through Naftali Tishby's 1999 introduction of the information bottleneck method in the Proceedings of the 37th Allerton Conference, which optimizes data compression for relevant predictions by balancing information retention and compression, later applied to understand representation learning in neural networks. This integration continued into the 2020s, with information-theoretic tools analyzing dynamics, such as flows in neural architectures, as explored in recent works unifying and Shannon entropy for frameworks up to 2025.

Fundamental Quantities

Entropy

In information theory, entropy quantifies the average uncertainty or associated with a discrete . For a discrete XX taking values in a X\mathcal{X} with p(x)=Pr(X=x)p(x) = \Pr(X = x) for xXx \in \mathcal{X}, the H(X)H(X) is defined as H(X)=xXp(x)log2p(x),H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x), where the logarithm is base 2, yielding a measure in bits. This formula, introduced by , represents the of the of a single outcome, where the self-information of an event with probability p(x)p(x) is log2p(x)-\log_2 p(x), the number of bits needed to specify the outcome on average. Entropy serves as the fundamental limit on the average number of bits required to the outcomes of XX without loss of , as established in lossless source coding theorems. For instance, consider a flip, where XX is Bernoulli with p(0)=p(1)=1/2p(0) = p(1) = 1/2; then H(X)=1H(X) = 1 bit, indicating one bit suffices to describe the outcome on average. For a biased coin with p(1)=0.9p(1) = 0.9 and p(0)=0.1p(0) = 0.1, H(X)=0.9log20.90.1log20.10.469H(X) = -0.9 \log_2 0.9 - 0.1 \log_2 0.1 \approx 0.469 bits, reflecting lower due to the bias. The h(p)h(p), a special case for Bernoulli random variables with success probability pp, is given by h(p)=plog2p(1p)log2(1p),h(p) = -p \log_2 p - (1-p) \log_2 (1-p), which reaches its maximum of 1 bit at p=0.5p = 0.5 and approaches 0 as pp nears 0 or 1. More generally, can be expressed in other units: nats using the natural logarithm (dividing by ln2\ln 2 converts bits to nats) or hartleys using base-10 logarithm, though bits are standard in digital contexts for aligning with binary encoding. Key properties of entropy include non-negativity, H(X)0H(X) \geq 0, with equality if and only if XX is deterministic (one outcome has probability 1); this follows from applied to the log2p-\log_2 p. Entropy is maximized for the uniform distribution over X\mathcal{X}, where H(X)=log2XH(X) = \log_2 |\mathcal{X}|, providing an upper bound on for a given alphabet size. For independent random variables XX and YY, entropy is additive: H(X,Y)=H(X)+H(Y)H(X,Y) = H(X) + H(Y). Shannon derived the entropy function from a set of axioms in his seminal 1948 paper: the measure must be continuous in the probabilities, monotonically increasing as probabilities become more uniform, uniquely determined up to a constant multiplier, and additive for independent events. Solving these yields H(X)=Kp(x)logp(x)H(X) = -K \sum p(x) \log p(x) for some positive constant KK, with K=1/ln2K = 1/\ln 2 chosen for bits; the proof involves showing that the logarithmic form satisfies the axioms and is unique.

Joint and Conditional Entropy

Joint entropy extends the concept of to two or more random variables, quantifying the total uncertainty in their joint occurrence. For discrete random variables XX and YY with joint probability mass function p(x,y)p(x,y), the joint entropy H(X,Y)H(X,Y) is defined as H(X,Y)=x,yp(x,y)log2p(x,y),H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y), which represents the average number of bits needed to specify both XX and YY together. This measure averages the self-information over the joint distribution, capturing dependencies between the variables. A key property of joint entropy is its : H(X,Y)H(X)+H(Y)H(X,Y) \leq H(X) + H(Y), with equality holding if and only if XX and YY are independent. Additionally, H(X,Y)max(H(X),H(Y))H(X,Y) \geq \max(H(X), H(Y)), reflecting that the combined cannot be less than the uncertainty of either variable alone. The chain rule provides a decomposition: H(X,Y)=H(X)+H(YX)H(X,Y) = H(X) + H(Y|X), linking joint entropy to and enabling recursive extensions to more variables, such as H(X1,,Xn)=i=1nH(XiX1,,Xi1)H(X_1, \dots, X_n) = \sum_{i=1}^n H(X_i | X_1, \dots, X_{i-1}). Conditional entropy H(YX)H(Y|X) measures the average remaining uncertainty about YY after observing XX. It is formally defined as H(YX)=x,yp(x,y)log2p(yx)=H(X,Y)H(X),H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x) = H(X,Y) - H(X), where p(yx)=p(x,y)/p(x)p(y|x) = p(x,y)/p(x) is the mass function. This quantity interprets as the expected of YY given each possible value of XX, weighted by the probability of XX. Properties of conditional entropy include non-negativity: H(YX)0H(Y|X) \geq 0, with equality if YY is a deterministic function of XX. Furthermore, conditioning cannot increase : H(YX)H(Y)H(Y|X) \leq H(Y), with equality if XX and YY are independent, indicating that knowledge of XX either reduces or leaves unchanged the in YY. As an example, consider two fair coin flips XX and YY. If independent, the joint entropy H(X,Y)=2H(X,Y) = 2 bits, equal to H(X)+H(Y)=1+1H(X) + H(Y) = 1 + 1. If Y=XY = X (perfect dependence), then H(X,Y)=1H(X,Y) = 1 bit, which is max(H(X),H(Y))\max(H(X), H(Y)). For conditional entropy, suppose YY is the next English letter after observing XX as the previous letter; H(YX)H(Y|X) is about 1.3 bits, much less than the unconditional H(Y)4.1H(Y) \approx 4.1 bits, due to contextual dependencies.

Mutual Information

Mutual information quantifies the amount of information that one contains about another, serving as a measure of their statistical dependence. Introduced by in his foundational work on , it captures the shared between two variables without assuming a specific form of relationship, such as . Formally, for discrete random variables XX and YY with joint p(x,y)p(x,y), the I(X;Y)I(X;Y) is defined as the difference between the of XX and the of XX given YY: I(X;Y)=H(X)H(XY).I(X;Y) = H(X) - H(X|Y). This equals the Kullback-Leibler divergence between the joint distribution and the product of the marginals: I(X;Y)=xyp(x,y)logp(x,y)p(x)p(y),I(X;Y) = \sum_{x} \sum_{y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}, where the logarithm is base 2 for units in bits. The measure is symmetric, such that I(X;Y)=I(Y;X)I(X;Y) = I(Y;X), reflecting that the information YY provides about XX is the same as the information XX provides about YY. Mutual information possesses several key properties. It is non-negative, I(X;Y)0I(X;Y) \geq 0, with equality holding if and only if XX and YY are independent, meaning knowledge of one provides no about the other. Additionally, it is bounded above by the minimum of the individual : I(X;Y)min(H(X),H(Y))I(X;Y) \leq \min(H(X), H(Y)), ensuring it cannot exceed the inherent uncertainty in either variable. In interpretive terms, mutual information represents the average reduction in uncertainty about XX upon observing YY, or equivalently, the number of bits of information shared between the two variables. This makes it a fundamental tool for assessing dependence in probabilistic systems, distinct from joint and , which measure absolute or context-dependent uncertainties. A chain rule extends mutual information to multiple variables: for random variables X1,X2,,XnX_1, X_2, \dots, X_n and YY, I(X1,X2,,Xn;Y)=I(X1;Y)+I(X2;YX1)++I(Xn;YX1,,Xn1),I(X_1, X_2, \dots, X_n; Y) = I(X_1; Y) + I(X_2; Y | X_1) + \cdots + I(X_n; Y | X_1, \dots, X_{n-1}), allowing decomposition of total dependence into conditional contributions. Examples illustrate these concepts clearly. If XX and YY are independent, then p(x,y)=p(x)p(y)p(x,y) = p(x)p(y), so I(X;Y)=0I(X;Y) = 0. Conversely, if Y=XY = X (identical variables), then H(XY)=0H(X|Y) = 0, yielding I(X;Y)=H(X)I(X;Y) = H(X), the full entropy of XX.

Divergence Measures

Divergence measures in information theory quantify the difference between two probability distributions, with the Kullback-Leibler (KL) divergence serving as a foundational example. For discrete probability distributions PP and QQ over the same , the KL divergence is defined as KL(PQ)=xP(x)logP(x)Q(x),\mathrm{KL}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}, where the logarithm is typically base-2 for bits or natural for nats. This measure is always non-negative, KL(PQ)0\mathrm{KL}(P \parallel Q) \geq 0, and equals zero if and only if P=QP = Q , a property established via the . For continuous distributions, the sum is replaced by an : KL(PQ)=p(x)logp(x)q(x)dx\mathrm{KL}(P \parallel Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx. The KL divergence exhibits key properties that distinguish it from metric distances. It is asymmetric, meaning KL(PQ)KL(QP)\mathrm{KL}(P \parallel Q) \neq \mathrm{KL}(Q \parallel P) in general, reflecting its directed nature from PP to QQ. Additionally, it does not satisfy the , preventing it from forming a true metric on the space of probability distributions. Introduced by Solomon Kullback and Richard Leibler in 1951, the KL divergence originated in the context of statistical hypothesis testing and information sufficiency. Interpretationally, KL(PQ)\mathrm{KL}(P \parallel Q) represents the expected number of extra bits required to encode samples drawn from PP using a code optimized for QQ, rather than one optimal for PP. Equivalently, it measures the information gain obtained when using the true distribution PP instead of the approximation QQ. This connects to , where I(X;Y)=EY[KL(PXYPX)]I(X;Y) = \mathbb{E}_{Y} \left[ \mathrm{KL}(P_{X|Y} \parallel P_X) \right], quantifying the average divergence between the conditional and marginal distributions of XX. A concrete example arises with Bernoulli distributions, where PBern(p)P \sim \mathrm{Bern}(p) and QBern(q)Q \sim \mathrm{Bern}(q). The KL divergence simplifies to KL(PQ)=plogpq+(1p)log1p1q\mathrm{KL}(P \parallel Q) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}, illustrating how it penalizes mismatches in success probabilities; for instance, with p=0.5p=0.5 and q=0.6q=0.6, KL(PQ)0.029\mathrm{KL}(P \parallel Q) \approx 0.029 bits. In applications like , the KL divergence underpins criteria such as the (AIC), which estimates the expected relative between the true data-generating and a fitted model to balance goodness-of-fit and complexity.

Source Coding

Compression Fundamentals

Source coding theory addresses the compression of data from information sources without loss of information, establishing fundamental limits on achievable compression rates. The source coding theorem, also known as the noiseless coding theorem, states that for a discrete memoryless source producing symbols from alphabet X\mathcal{X} with probability distribution p(x)p(x), the optimal average codeword length LL for any uniquely decodable code satisfies LH(X)L \geq H(X), where H(X)=xXp(x)log2p(x)H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x) is the entropy of the source. This bound is achievable in the limit as the block length increases, meaning sequences of source symbols can be compressed to rates arbitrarily close to the entropy using block codes that are uniquely decodable. The theorem implies that the minimum compression rate RR required for reliable lossless encoding of the source is RH(X)R \geq H(X) bits per symbol, ensuring that the encoded representation captures the inherent uncertainty or of the source without redundancy beyond this limit. For sources with lower than the logarithm of the size, significant compression is possible by exploiting statistical dependencies, as quantifies the average surprise or unpredictability per symbol. Huffman coding provides an optimal prefix-free code for discrete memoryless sources with known symbol probabilities, achieving an average code length within 1 bit of the bound. The algorithm proceeds by first sorting the symbols in decreasing order of their probabilities. It then iteratively builds a : the two symbols (or subtrees) with the lowest probabilities are combined into a new node with probability equal to their sum, assigning 0 and 1 to the branches, and this process repeats until a single root remains. The codewords are the paths from root to leaves, ensuring no code is a prefix of another and minimizing the weighted path length, which equals the average code length. Arithmetic coding offers superior efficiency for memoryless sources by encoding entire messages as fractions of the unit interval, rather than fixed-length codes per , approaching the bound more closely without the integer-length overhead of . In this method, the message's probability interval [0,1)[0,1) is subdivided according to symbol probabilities; each symbol narrows the current interval proportionally, and the final interval's binary representation (or a tag within it) serves as the code, with the decoder reversing the process to recover the symbols. This fractional encoding allows the code length to be approximately log2P(m)-\log_2 P(m) bits for message mm with probability P(m)P(m), yielding near-entropy performance even for single symbols. A practical example is the compression of English text, where the entropy is approximately 1 to 1.5 bits per character due to letter frequencies and contextual redundancies in , far below the 4.7 bits per character (log₂26 for lowercase letters) of uniform encoding. This redundancy, arising from predictable patterns like common digrams and grammatical structure, enables compression ratios of about 3:1 to 5:1 using entropy-based methods, as demonstrated in early human prediction experiments that estimated the per-character information content.

Practical Coding Techniques

Variable-length codes assign shorter codewords to more probable symbols and longer ones to less probable symbols, enabling efficient compression of sources with uneven symbol probabilities. One early method is Shannon-Fano coding, which constructs codes by recursively splitting the into two subsets of approximately equal total probability, assigning binary digits accordingly. This top-down approach ensures the code lengths are close to the negative logarithm of the probabilities, though it may not always yield optimal lengths. In contrast, Huffman coding achieves optimality for prefix-free codes over a fixed by building a bottom-up, merging the two least probable symbols iteratively to form words whose lengths satisfy the Kraft inequality with equality. While Shannon-Fano is simpler to implement without requiring full probability sorting, Huffman codes generally produce shorter average lengths, making them preferable for static sources where the full is known in advance. Both methods approach the entropy bound for large alphabets but incur redundancy for small ones due to code lengths. For universal compression of unknown or streaming data, the Lempel-Ziv-Welch (LZW) algorithm builds a dictionary adaptively by replacing repeated substrings with codes referencing prior occurrences. Starting from an initial dictionary of single symbols, LZW parses the input into phrases, outputs the code for the longest matching prefix, and adds the extension to the dictionary, enabling real-time learning without prior statistics. This variant of the LZ78 scheme excels on repetitive data, achieving compression ratios near the asymptotically, and is widely used in formats like ZIP archives and images for its simplicity and effectiveness on text and graphics. Lossy compression techniques sacrifice fidelity for higher rates, guided by rate-distortion theory, which quantifies the minimum rate R(D)R(D) needed to represent a source at average distortion no greater than DD. For a source XX and reproduction X^\hat{X} with distortion E[d(X,X^)]DE[d(X, \hat{X})] \leq D, R(D)=minp(x^x):E[d(X,X^)]DI(X;X^),R(D) = \min_{p(\hat{x}|x): E[d(X,\hat{X})] \leq D} I(X; \hat{X}), where the I(X;X^)I(X; \hat{X}) captures the information preserved in the approximation. Practical source-side methods focus on transforming the to concentrate in few coefficients, quantizing them coarsely for low distortion, and -coding the result, balancing rate against perceptual quality without channel considerations. Adaptive coding addresses non-stationary sources by updating models on-the-fly, avoiding assumptions of fixed statistics. Prediction by partial matching (PPM) exemplifies this by estimating symbol probabilities based on increasingly longer contexts from recent history, escaping to lower-order models when unseen patterns arise to prevent . This hierarchical approach achieves superior compression on or by capturing evolving dependencies, often outperforming static Huffman on adaptive arithmetic coders. A representative application is the standard for , which applies the (DCT) to 8x8 blocks to decorrelate pixels based on spatial statistics, yielding coefficients that are quantized and Huffman-coded. By exploiting human visual sensitivity—retaining low-frequency details while discarding high-frequency —JPEG trades minor artifacts for ratios of 10:1 to 20:1 on typical photos, though computational costs in DCT and quantization rise with . These techniques highlight the tension between achieving near-theoretical rates and practical constraints like decoder simplicity.

Channel Coding

Capacity and Noisy Channels

In information theory, a discrete memoryless channel (DMC) is modeled as a probabilistic mapping from input symbols XX drawn from a finite X\mathcal{X} to output symbols YY from a finite Y\mathcal{Y}, characterized by conditional transition probabilities p(yx)p(y|x) that specify the probability of receiving yy given that xx was transmitted, with outputs independent across time slots. The capacity CC of such a channel represents the supremum of rates at which information can be reliably transmitted and is defined as the maximum mutual information between input and output over all possible input distributions: C=maxp(x)I(X;Y),C = \max_{p(x)} I(X; Y), where I(X;Y)=H(Y)H(YX)I(X; Y) = H(Y) - H(Y|X) quantifies the reduction in uncertainty about YY provided by knowledge of XX. Shannon's noisy-channel coding theorem establishes the fundamental limits of reliable communication over such channels: for any rate R<CR < C, there exists a coding scheme achieving arbitrarily low probability of error as the block length nn \to \infty; conversely, for R>CR > C, the error probability is bounded away from zero. A sketch of the achievability proof relies on random coding and typical set decoding. In random coding, 2nR2^{nR} codewords are generated independently for each message, with each symbol drawn i.i.d. from the capacity-achieving input distribution p(x)p(x); the average error probability over this ensemble vanishes exponentially as nn \to \infty for R<I(X;Y)R < I(X; Y), implying the existence of at least one good code via the . For decoding, the receiver selects the unique codeword jointly typical with the received sequence yny^n, where jointly typical sequences (xn,yn)(x^n, y^n) satisfy p^(xi,yi)p(xi,yi)<ϵ/n|\hat{p}(x_i, y_i) - p(x_i, y_i)| < \epsilon/n for small ϵ\epsilon, ensuring that with high probability, the correct codeword is decoded due to the concentration of probability mass on the of size approximately 2nH(X,Y)2^{n H(X,Y)}. For the binary symmetric channel (BSC), where X=Y={0,1}\mathcal{X} = \mathcal{Y} = \{0,1\} and bits flip with probability p<1/2p < 1/2, the capacity-achieving input distribution is , yielding C=1h2(p),C = 1 - h_2(p), with h2(p)=plog2p(1p)log2(1p)h_2(p) = -p \log_2 p - (1-p) \log_2 (1-p) the , so C=1C = 1 bit per use when p=0p=0 (noiseless) and decreases to 0 as p1/2p \to 1/2. In the continuous-time setting with (AWGN), the channel adds independent to the transmitted signal under a power constraint; discretizing into 2W2W dimensions per second () for bandwidth WW, the capacity simplifies to 12log2(1+SNR)\frac{1}{2} \log_2 (1 + \mathrm{SNR}) bits per dimension in the high-resolution limit, where SNR is the P/NP/N with transmit power PP and noise power spectral density N/2N/2. This capacity CC interprets as the supreme reliable transmission rate in bits per channel use, beyond which fundamentally limits error-free communication, even with optimal encoding and decoding.

Specific Channel Models

The (BEC) models a communication where each transmitted bit is received correctly with probability 1α1 - \alpha or results in an erasure symbol with probability α\alpha, allowing the receiver to detect but not recover erased bits. Introduced by in his foundational work on coding for noisy channels, the BEC simplifies analysis by separating errors from erasures. The capacity of the BEC is C=1αC = 1 - \alpha bits per channel use, derived as the maximum I(X;Y)I(X;Y) over input distributions, where the erasure provides partial information. This capacity is achievable with low-complexity codes like parity-check or low-density parity-check (LDPC) codes that detect erasures reliably, as demonstrated by sequences approaching the capacity limit under maximum a posteriori decoding. Asymmetric binary channels, such as the Z-channel, extend the BEC by introducing biased patterns where one input (typically 0) is received error-free, while the other (1) flips to 0 with probability α\alpha and stays 1 otherwise. The Z-channel capacity requires numerical optimization over the input distribution p=P(X=1)p = P(X=1), given by C=maxp[h(β)ph(α)]C = \max_p [h(\beta) - p h(\alpha)], where β=p(1α)\beta = p(1 - \alpha) is the probability of output 1 and h()h(\cdot) denotes binary entropy; this formula arises from maximizing I(X;Y)=h(Y)h(YX)I(X;Y) = h(Y) - h(Y|X), with h(YX)=ph(α)h(Y|X) = p h(\alpha). Coding strategies for the Z-channel often involve asymmetric error-correcting codes, such as repeat-accumulate ensembles, which approach capacity through iterative decoding optimized for the . Similar optimization techniques apply to other asymmetric binaries like the binary asymmetric channel, where error probabilities differ for each input, yielding capacities via of the . For continuous-time channels, the (AWGN) channel under a power constraint PP represents a fundamental model, with capacity C=Wlog2(1+PN0W)C = W \log_2 \left(1 + \frac{P}{N_0 W}\right) bits per second for bandwidth WW Hz, where N0N_0 is the noise power spectral density; this seminal result, established by Shannon, assumes Gaussian input distribution to maximize . In frequency-selective fading, parallel Gaussian channels with varying noise levels σi2\sigma_i^2 and total power constraint PiP\sum P_i \leq P achieve capacity via the , allocating power Pi=max(μσi2,0)P_i = \max(\mu - \sigma_i^2, 0) such that Pi=P\sum P_i = P, where μ\mu is chosen to satisfy the constraint, leading to C=12log2(1+Piσi2)C = \sum \frac{1}{2} \log_2 \left(1 + \frac{P_i}{\sigma_i^2}\right). This allocation prioritizes lower-noise subchannels, enhancing in multicarrier systems like OFDM. Multiple-input multiple-output (MIMO) channels generalize the Gaussian model to systems with ntn_t transmit and nrn_r receive antennas, where the received signal is Y=HX+Z\mathbf{Y} = \mathbf{H} \mathbf{X} + \mathbf{Z}, with channel matrix H\mathbf{H}, input Σ\boldsymbol{\Sigma} under trace constraint tr(Σ)P\operatorname{tr}(\boldsymbol{\Sigma}) \leq P, and ZN(0,N0I)\mathbf{Z} \sim \mathcal{N}(0, N_0 \mathbf{I}). The ergodic capacity is C=maxΣE[log2det(I+HΣHHN0)]C = \max_{\boldsymbol{\Sigma}} \mathbb{E} \left[ \log_2 \det \left( \mathbf{I} + \frac{\mathbf{H} \boldsymbol{\Sigma} \mathbf{H}^H}{N_0} \right) \right] bits per channel use, achieved with Gaussian inputs; for fixed H\mathbf{H}, it simplifies to log2det(I+HΣHHN0)\log_2 \det \left( \mathbf{I} + \frac{\mathbf{H} \boldsymbol{\Sigma} \mathbf{H}^H}{N_0} \right). This formula, derived by Telatar and independently by Foschini, highlights multiplexing gains scaling with min(nt,nr)\min(n_t, n_r), enabling high-throughput in wireless systems. Erasure and deletion channels pose greater challenges than the BEC, as deletions remove symbols without explicit indication, complicating synchronization and decoding. For the binary deletion channel with deletion probability δ\delta, the capacity remains an open problem but is bounded above by 1h(δ)1 - h(\delta) and below by constructive codes achieving rates approaching 1δ1 - \delta for small δ\delta; exact capacity expressions are unknown except in limits. The Varshamov-Tenengolts (VT) codes, introduced for single asymmetric errors but adaptable to single deletions, provide near-optimal constructions with rate 1lognn\approx 1 - \frac{\log n}{n} for block length nn, using a parity-check sum ixi0(modn+1)\sum i x_i \equiv 0 \pmod{n+1} to locate the deletion. For multiple deletions, bounds rely on combinatorial arguments, with capacity for small δ\delta approaching 1H2(δ)1δlog2(1/δ)1 - H_2(\delta) \approx 1 - \delta \log_2 (1/\delta). These channel models inform practical system design, such as in modems where BEC-like erasure handling via improves reliability over links, and water-filling optimizes power allocation in DSL modems to combat frequency-dependent . In 5G New Radio (NR) standards, channel models from TR 38.901 incorporate clustered delay line (CDL) and tapped delay line (TDL) profiles for frequencies up to 100 GHz, enabling capacity computations under spatial consistency and , with implications for enhanced achieving up to 20 Gbps peak rates.

Advanced Concepts

Directed Information

Directed information generalizes to processes exhibiting temporal dependencies, providing a measure of causal from one process to another in a directed manner. Formalized by James L. Massey in 1990, building on ideas from Hans Marko's bidirectional , it addresses scenarios where the influence flows asymmetrically over time, such as in communication channels with memory or feedback, where traditional fails to capture . This measure is particularly valuable in analyzing systems where past outputs influence future inputs, enabling the quantification of in non-stationary or sequential settings. The directed information from an input process Xn=(X1,,Xn)X^n = (X_1, \dots, X_n) to an output process Yn=(Y1,,Yn)Y^n = (Y_1, \dots, Y_n) is formally defined as I(XnYn)=t=1nI(Xt;YtYt1),I(X^n \to Y^n) = \sum_{t=1}^n I(X_t; Y_t \mid Y^{t-1}), where I(;)I(\cdot; \cdot \mid \cdot) denotes , Yt1=(Y1,,Yt1)Y^{t-1} = (Y_1, \dots, Y_{t-1}) (with Y0Y^0 empty), and the sum aggregates the incremental information contributed by each input symbol given the prior output history. This formulation emphasizes by conditioning only on past outputs, avoiding non-causal dependencies on future values. The directed information rate is then Iˉ(XY)=limn1nI(XnYn)\bar{I}(X \to Y) = \lim_{n \to \infty} \frac{1}{n} I(X^n \to Y^n), assuming the limit exists. For channels with memory, where the output at each time depends on the entire input and possibly past outputs, the capacity is characterized using the directed information rate. Specifically, the capacity CC is given by C=suplimn1nI(XnYn),C = \sup \lim_{n \to \infty} \frac{1}{n} I(X^n \to Y^n), where the supremum is taken over all causal input distributions PXnYn1P_{X^n \| Y^{n-1}} that respect the channel's structure. This multi-letter expression accounts for the temporal correlations, making it suitable for channels like those with inter-symbol interference or additive noise with . Key properties of directed information include its non-negativity and the , I(XnYn)I(XnZn)I(X^n \to Y^n) \leq I(X^n \to Z^n) for a channel from YnY^n to ZnZ^n. It reduces to the standard I(Xn;Yn)I(X^n; Y^n) for memoryless channels or when there is no feedback, as I(XnYn)=I(Xn;Yn)I(X^n \to Y^n) = I(X^n; Y^n) holds under of the YtY_t. Unlike mutual information, directed information is asymmetric, I(XnYn)I(YnXn)I(X^n \to Y^n) \neq I(Y^n \to X^n) in general, which explicitly encodes the direction of information flow and distinguishes causal influences. The referenced here builds on the concept from analysis, measuring the reduction in uncertainty about YtY_t given Yt1Y^{t-1}. In applications to feedback channels, where the input XtX_t may depend on previous outputs Yt1Y^{t-1}, directed information provides the appropriate metric for capacity, as mutual assumes non-causal access. The feedback capacity satisfies CfeedbackCno feedbackC_{\text{feedback}} \geq C_{\text{no feedback}}, with equality for discrete memoryless channels but strict increase possible for channels with memory. A notable example is the Schalkwijk-Kailath scheme for the channel with feedback, which achieves the feedback capacity through iterative linear encoding that refines message estimates, demonstrating practical gains in error exponents despite no increase in asymptotic capacity for this memoryless case. This scheme highlights how directed information guides optimal coding strategies in feedback scenarios by focusing on causal contributions. Examples of directed information computation arise in Markovian settings, where sources or channels follow a , simplifying the conditional terms in the sum. For a Markov source driving a Markov channel, the directed information rate can be evaluated using the and transition probabilities, yielding closed-form expressions for the causal flow.

Information in Non-Standard Settings

Information theory extends beyond classical discrete and continuous domains to encompass , computational paradigms, and complex stochastic processes, where traditional Shannon entropy must be generalized to capture phenomena like superposition, undecidability, and long-range correlations. In quantum settings, information is quantified using density operators rather than probability distributions, leading to measures that account for non-commutativity and entanglement. Algorithmic variants shift focus from probabilistic uncertainty to the intrinsic of individual objects via program length. These extensions provide foundational tools for fields like and , revealing limits unattainable in classical frameworks. In quantum information theory, the von Neumann entropy serves as the primary measure of uncertainty for a quantum state described by a density matrix ρ\rho, defined as S(ρ)=Tr(ρlogρ),S(\rho) = -\operatorname{Tr}(\rho \log \rho), where Tr\operatorname{Tr} denotes the trace operation and the logarithm is base-2 for bits. This entropy generalizes Shannon's entropy to quantum mechanics, quantifying the mixedness of ρ\rho while vanishing for pure states. For an ensemble of states {pi,ρi}\{p_i, \rho_i\}, the Holevo bound χ\chi upper-bounds the accessible classical information, given by χ=S(ipiρi)ipiS(ρi),\chi = S\left(\sum_i p_i \rho_i\right) - \sum_i p_i S(\rho_i), establishing a fundamental limit on how much classical data can be reliably transmitted through quantum channels without measurement collapse. The quantum mutual information between subsystems AA and BB of a bipartite state ρAB\rho_{AB} extends this to correlations, defined as I(A:B)=S(ρA)+S(ρB)S(ρAB)I(A:B) = S(\rho_A) + S(\rho_B) - S(\rho_{AB}), where ρA=TrB(ρAB)\rho_A = \operatorname{Tr}_B(\rho_{AB}) and similarly for ρB\rho_B; this measure is non-negative and equals twice the entanglement entropy for pure states. Quantum channels, modeled as completely positive trace-preserving maps, have capacities constrained by these quantities; for the qubit depolarizing channel Np(ρ)=(1p)ρ+pI2\mathcal{N}_p(\rho) = (1-p)\rho + p \frac{I}{2}, where pp is the depolarizing probability and II is the identity, the classical capacity equals the Holevo information χ(Np)\chi(\mathcal{N}_p), achieved additively even when tensored with arbitrary channels, while the quantum capacity remains open but bounded above by approximate degradability for low pp. Algorithmic information theory redefines information content through the lens of computation, independent of probability. The K(x)K(x) of a binary string xx is the length of the shortest program in a fixed that outputs xx and halts, providing an absolute, observer-independent measure of complexity. This notion, introduced by , captures the minimal description length required to specify xx, rendering random strings those with K(x)xK(x) \approx |x| (incompressible), but K(x)K(x) is uncomputable due to the , limiting its practical use to approximations like compression algorithms. analogs, such as I(x:y)=K(x)+K(y)K(x,y)I(x:y) = K(x) + K(y) - K(x,y), quantify shared complexity between objects, foundational for understanding and induction despite theoretical intractability. For stochastic processes, the entropy rate quantifies average uncertainty per symbol in the long run. For a stationary discrete-time process {Xi}\{X_i\}, the entropy rate is defined as H=limn1nH(X1,,Xn),H = \lim_{n \to \infty} \frac{1}{n} H(X_1, \dots, X_n), where the limit exists under stationarity and equals limnH(XnX1,,Xn1)\lim_{n \to \infty} H(X_n | X_1, \dots, X_{n-1}) by the chain rule. This rate, originally explored by , governs the fundamental limit for of process outputs, with finite values for ergodic processes like Markov chains but infinity for non-stationary ones. It bridges classical information theory to dynamical systems, enabling analysis of correlation decay in sources like language models or physical . These concepts find applications in quantum gravity, such as the AdS/CFT correspondence, where holographic information posits that bulk gravitational degrees of freedom in anti-de Sitter (AdS) space encode boundary () information, with entanglement entropy across minimal surfaces bounding the Ryu-Takayanagi formula S=Area(γ)4GNS = \frac{\operatorname{Area}(\gamma)}{4G_N}, where γ\gamma is the extremal surface homologous to the boundary region; this duality suggests emergent spacetime from limits, as explored in generalizations to excited states and deformed geometries.

Applications

Communication and Secrecy

Information theory provides foundational principles for , ensuring that messages remain confidential even against adversaries with unlimited computational power. A cornerstone is the concept of perfect secrecy, introduced by , which requires that the between the and the is zero, meaning the adversary gains no information about the message from the ciphertext alone. This condition holds if the key used for encryption is at least as long as the message and uniformly random, independent of the message, as demonstrated in the scheme where the ciphertext is the modular sum of the message and key. In terms of , perfect secrecy demands that the key entropy H(K)H(K) satisfies H(K)H(M)H(K) \geq H(M), where H(M)H(M) is the entropy of the message MM, ensuring the key cannot be shorter than the message's uncertainty without compromising security. In noisy communication environments, the wiretap channel model extends these ideas to scenarios where an eavesdropper receives a degraded version of the signal. Aaron Wyner defined the secrecy capacity as the maximum rate at which a message can be reliably transmitted to the legitimate receiver while keeping it perfectly secret from the eavesdropper, given by Cs=maxp(x)[I(X;Y)I(X;Z)]C_s = \max_{p(x)} [I(X;Y) - I(X;Z)], where XX is the input, YY the legitimate output, and ZZ the eavesdropper's output. For degraded channels, where the eavesdropper's channel is a degraded version of the main channel, this simplifies to Cs=CmainCwiretapC_s = C_{\text{main}} - C_{\text{wiretap}}, with CmainC_{\text{main}} and CwiretapC_{\text{wiretap}} being the capacities of the respective channels. Achieving this capacity involves stochastic encoding, where the transmitter adds random noise to confuse the eavesdropper without affecting the legitimate receiver's decoding, leveraging the difference in mutual informations. Authentication and key distribution in information-theoretic settings rely on mutual information measures to bound security, particularly in protocols that generate shared secrets over public channels. showed that parties observing correlated random variables can distill a secret key through public discussion, with the achievable key rate bounded by the between their observations minus leakage to the adversary. In analogs, information-theoretic bounds use the asymptotic equipartition property (AEP) to analyze extractors, such as universal hash functions, which compress high- sources into uniform keys while preserving secrecy; the AEP ensures that typical sequences have close to the source's, enabling efficient extraction. These extractors guarantee that the output key's is nearly the input's, providing strong security guarantees independent of computational assumptions. Steganography applies information theory to hide messages within cover media such that the presence of the hidden information is undetectable. Christian Cachin formalized this using the Kullback-Leibler (KL) divergence, defining perfect when the between the distributions of cover and stego (hidden) objects approaches zero, ensuring the adversary cannot distinguish them better than random guessing. The embedding rate is limited by the KL divergence, as higher rates increase detectability; secure schemes minimize D(PcoverPstego)D(P_{\text{cover}} \| P_{\text{stego}}) through careful of the cover while preserving statistical . Practical examples illustrate these principles. In quantum key distribution, the protocol by Charles Bennett and enables information-theoretically secure key exchange using polarized photons, where error correction and privacy amplification—via hash-based extractors—distill a secret key from the shared quantum measurements, secure against any eavesdropper due to the and entropy extraction.

Computing and Machine Learning

Information theory plays a pivotal role in the design and analysis of pseudorandom number generators (PRNGs), where information-theoretic tests evaluate the unpredictability and of generated sequences to ensure they mimic true for cryptographic and simulation purposes. These tests, grounded in measures, assess whether a PRNG's output distribution is indistinguishable from uniform by quantifying deviations in or between consecutive outputs. For instance, tests based on source coding theory compress PRNG outputs and compare the achieved compression rates against those expected from truly random sources, revealing biases if the falls short of the theoretical maximum. A key application in involves extraction from weak random sources, where the leftover hash lemma provides a foundational result for converting sources with partial into nearly uniform keys. The lemma states that, for a source with kk and a family of universal hash functions, hashing the source yields an output whose from uniform is at most 2k22^{-\frac{\ell - k}{2}}, where \ell is the output length, enabling secure even from imperfect like hardware . This principle underpins privacy amplification in , ensuring that adversaries gain negligible information about the extracted bits. In , the information bottleneck (IB) principle formalizes the trade-off between data compression and predictive utility by seeking representations that minimize with inputs while maximizing it with outputs. Introduced as a method to extract relevant features through a bottleneck, IB optimizes the objective I(X;T)βI(T;Y)I(X; T) - \beta I(T; Y), where XX is the input, TT the compressed representation, YY the target, and β\beta balances compression against relevance. Variational bounds approximate this non-convex optimization, enabling scalable training via neural networks that approximate the posterior p(tx)p(t|x). This framework has influenced by promoting parsimonious models that discard irrelevant noise. Neural networks leverage information theory for representation learning, as seen in InfoGAN, which extends generative adversarial networks by maximizing between latent codes and generated images to yield interpretable, disentangled factors. In InfoGAN, the objective augments the GAN loss with I(c;G(z,c))I(c; G(z, c)), where cc is a subset of informative noise, estimated via a variational lower bound to encourage structured latent spaces without supervision. Similarly, rate-distortion theory informs autoencoders by framing encoding as a compression problem, minimizing distortion d(x,x^)d(x, \hat{x}) subject to a rate constraint on the I(x;z)I(x; z), where zz is the latent code. This approach yields efficient embeddings for tasks like dimensionality reduction, with the rate-distortion function R(D)=minI(X;Z)R(D) = \min I(X; Z) such that E[d(X,X^)]DE[d(X, \hat{X})] \leq D. In analysis, measures enhance clustering by quantifying the uncertainty reduction achieved by partitioning data into groups. An information-theoretic perspective determines the optimal number of clusters by minimizing the total across clusters while penalizing complexity, treating clustering as a balance between description length and fit, akin to the minimum description length principle. For example, the of a clustering is H=pklogpk+kpkHkH = -\sum p_k \log p_k + \sum_k p_k H_k, where pkp_k is the probability of cluster kk and HkH_k its internal , guiding algorithms to avoid over- or under-clustering. in high-dimensional settings employs to identify variables that provide unique predictive power given others, selecting subsets that maximize I(f;yS)I(f; y | S), where ff is a candidate feature, yy the target, and SS the selected set, mitigating redundancy and improving model efficiency. Recent advances as of 2025 integrate information theory with generative models, particularly , where the forward and reverse processes are analyzed through flows to bound sample quality and training efficiency. In information-theoretic frameworks, the score function aligns with minimum estimation under constraints, revealing that diffusion paths preserve and restore information gradients akin to rate-distortion curves in reverse denoising. For large language models (LLMs), compression bounds limit training efficacy, as the "entropy law" links first-epoch loss to compressibility, showing that LLM scales with the negative log of compression ratios achieved by the model on training corpora. This implies fundamental ceilings on generalization from finite datasets, where loss cannot drop below the source without .

References

Add your contribution
Related Hubs
User Avatar
No comments yet.