Hubbry Logo
Information contentInformation contentMain
Open search
Information content
Community hub
Information content
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Information content
Information content
from Wikipedia

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.

The Shannon information is closely related to entropy, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.[1]

The information content can be expressed in various units of information, of which the most common is the "bit" (more formally called the shannon), as explained below.

The term 'perplexity' has been used in language modelling to quantify the uncertainty inherent in a set of prospective events.[citation needed]

Definition

[edit]

Claude Shannon's definition of self-information was chosen to meet several axioms:

  • An event with probability 100% is perfectly unsurprising and yields no information.
  • The less probable an event is, the more surprising it is and the more information it yields.
  • If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.

The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number and an event with probability , the information content is defined as the negative log probability:The base corresponds to the scaling factor above. Different choices of b correspond to different units of information: when , the unit is the shannon (symbol Sh), often called a 'bit'; when , the unit is the natural unit of information (symbol nat); and when , the unit is the hartley (symbol Hart).

Formally, given a discrete random variable with probability mass function , the self-information of measuring as outcome is defined as:[2]The use of the notation for self-information above is not universal. Since the notation is also often used for the related quantity of mutual information, many authors use a lowercase for self-entropy instead, mirroring the use of the capital for the entropy.

Properties

[edit]

Monotonically decreasing function of probability

[edit]

For a given probability space, the measurement of rarer events are intuitively more "surprising", and yield more information content than more "common" events. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function.[3]

While standard probabilities are represented by real numbers in the interval , self-information values are non-negative extended real numbers in the interval . Specifically:

  • An event with probability (a certain event) has an information content of . Its occurrence is perfectly unsurprising and reveals no new information.
  • An event with probability (an impossible event) has an information content of , which is undefined but is taken to be by convention. This reflects that observing an event believed to be impossible would be infinitely surprising.[4]

This monotonic relationship is fundamental to the use of information content as a measure of uncertainty. For example, learning that a one-in-a-million lottery ticket won provides far more information than learning it lost (See also Lottery mathematics.) This also establishes an intuitive connection to concepts like statistical dispersion; events that are far from the mean or typical outcome (and thus have low probability in many common distributions) have high self-information.

Relationship to log-odds

[edit]

The Shannon information is closely related to the log-odds. The log-odds of an event , with probability , is defined as the logarithm of the odds, . This can be expressed as a difference of two information content values:where denotes the event not .

This expression can be interpreted as the amount of information gained (or surprise) from learning the event did not occur, minus the information gained from learning it did occur. This connection is particularly relevant in statistical modeling where log-odds are the core of the logit function and logistic regression.[5]

Additivity of independent events

[edit]

The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics. Consider two independent random variables and with probability mass functions and . The joint probability of observing the outcome is given by the product of the individual probabilities due to independence:The information content of this joint event is:This additivity makes information content a more mathematically convenient measure than probability in many applications, such as in coding theory where the amount of information needed to describe a sequence of independent symbols is the sum of the information needed for each symbol.[3]

The corresponding property for likelihoods is that the log-likelihood of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.

Relationship to entropy

[edit]

The Shannon entropy of the random variable is defined as:by definition equal to the expected information content of measurement of .[6]: 11 [7]: 19–20 

The expectation is taken over the discrete values over its support.

Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies , where is the mutual information of with itself.[8]

For continuous random variables the corresponding concept is differential entropy.

Notes

[edit]

Examples

[edit]

Fair coin toss

[edit]

Consider the Bernoulli trial of tossing a fair coin . The probabilities of the events of the coin landing as heads and tails (see fair coin and obverse and reverse) are one half each, . Upon measuring the variable as heads, the associated information gain is so the information gain of a fair coin landing as heads is 1 shannon.[2] Likewise, the information gain of measuring tails is

Fair die roll

[edit]

Suppose we have a fair six-sided die. The value of a die roll is a discrete uniform random variable with probability mass function The probability of rolling a 4 is , as for any other valid roll. The information content of rolling a 4 is thusof information.

Two independent, identically distributed dice

[edit]

Suppose we have two independent, identically distributed random variables each corresponding to an independent fair 6-sided dice roll. The joint distribution of and is

The information content of the random variate is and can also be calculated by additivity of events

Information from frequency of rolls

[edit]

If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variables for , then and the counts have the multinomial distribution

To verify this, the 6 outcomes correspond to the event and a total probability of 1/6. These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other combinations correspond to one die rolling one number and the other die rolling a different number, each having probability 1/18. Indeed, , as required.

Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one die was one number and the other was a different number. Take for examples the events and for . For example, and .

The information contents are

Let be the event that both dice rolled the same value and be the event that the dice differed. Then and . The information contents of the events are

Information from sum of dice

[edit]

The probability mass or density function (collectively probability measure) of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable has probability mass function , where represents the discrete convolution. The outcome has probability . Therefore, the information asserted is

General discrete uniform distribution

[edit]

Generalizing the § Fair dice roll example above, consider a general discrete uniform random variable (DURV) For convenience, define . The probability mass function is In general, the values of the DURV need not be integers, or for the purposes of information theory even uniformly spaced; they need only be equiprobable.[2] The information gain of any observation is

Special case: constant random variable

[edit]

If above, degenerates to a constant random variable with probability distribution deterministically given by and probability measure the Dirac measure . The only value can take is deterministically , so the information content of any measurement of isIn general, there is no information gained from measuring a known value.[2]

Categorical distribution

[edit]

Generalizing all of the above cases, consider a categorical discrete random variable with support and probability mass function given by

For the purposes of information theory, the values do not have to be numbers; they can be any mutually exclusive events on a measure space of finite measure that has been normalized to a probability measure . Without loss of generality, we can assume the categorical distribution is supported on the set ; the mathematical structure is isomorphic in terms of probability theory and therefore information theory as well.

The information of the outcome is given

From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.

Derivation

[edit]

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin:

Weather forecast for tonight: dark. ] Continued dark overnight, with widely scattered light by morning.[13]

Assuming that one does not reside near the polar regions, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.

Accordingly, the amount of self-information contained in a message conveying an occurrence of event, , depends only on the probability of that event.for some function to be determined. If , then . If , then .

Further, by definition, the measure of self-information is nonnegative and additive. If an event is the intersection of two independent events and , then the information of event occurring is the sum of the amounts of information of the individual events and :Because of the independence of events and , the probability of event is:Relating the probabilities to the function :This is a functional equation. The only continuous functions with this property are the logarithm functions. Therefore, must be of the form:for some base and constant . Since a low-probability event must correspond to high information content, the constant must be negative. We can write and absorb any scaling into the base of the logarithm. This gives the final form:The smaller the probability of event , the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of is shannon. This is the most common practice. When using the natural logarithm of base , the unit will be the nat. For the base 10 logarithm, the unit of information is the hartley.

As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 shannons (probability 1/16), and the information content associated with getting a result other than the one specified would be shannons. See above for detailed examples.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Information content, also known as self-information or surprisal, is a fundamental concept in that quantifies the amount of or surprise associated with a specific outcome of a random event, measured in bits. It is mathematically defined as the negative base-2 logarithm of the probability pp of the event occurring, given by the formula I=log2pI = -\log_2 p, where rarer events (lower pp) yield higher information content. For instance, the outcome of a flip has an information content of 1 bit, since p=0.5p = 0.5 and log20.5=1-\log_2 0.5 = 1, while a highly probable event like a biased landing on its favored side (e.g., p=0.9p = 0.9) carries only about 0.15 bits. This measure was introduced by Claude Shannon in his seminal 1948 paper "A Mathematical Theory of Communication," where it serves as the building block for entropy, the expected value of information content over all possible outcomes of a random variable. Entropy HH is thus calculated as H=pi(log2pi)H = \sum p_i (-\log_2 p_i), representing the average information content and providing a limit on the efficiency of data compression and transmission. Key properties include its additivity for independent events—where the total information content is the sum of individual ones—and its role in distinguishing meaningful signals from noise in communication systems. Beyond its origins in electrical engineering, information content has broad applications in fields such as statistics, , and , where it models in patterns, algorithmic complexity, and neural signaling. For example, in data compression, sources with high require more bits to encode without loss, while low-entropy predictable sources can be compressed efficiently. The concept underscores that information is not just volume but the resolution of , influencing modern technologies like error-correcting codes and models that optimize for informational efficiency.

Basic Concepts

Formal Definition

In , the information content, also known as self-information, quantifies the amount of information associated with the occurrence of a specific outcome in a discrete probability space. For a discrete XX taking values in a finite or countably infinite set X\mathcal{X}, the information content of the event {X=x}\{X = x\}, where xXx \in \mathcal{X}, is defined as I(X=x)=logbP(X=x),I(X = x) = -\log_b P(X = x), with P(X=x)P(X = x) denoting the probability of the event and b>1b > 1 the base of the logarithm. This formulation applies specifically to discrete probability distributions, where P(X=x)P(X = x) is given by the of XX. A common notation convention uses the lowercase function i(x)=logbp(x)i(x) = -\log_b p(x), with p(x)=P(X=x)p(x) = P(X = x). The choice of base bb determines the units: b=2b = 2 yields bits, while b=eb = e yields nats. A special case occurs when P(X=x)=1P(X = x) = 1, corresponding to a certain event, in which I(X=x)=0I(X = x) = 0. This reflects that no additional is conveyed by an outcome that is guaranteed to happen.

Interpretation as Surprisal

The information content of an event, often interpreted as surprisal, quantifies the degree of surprise or unexpectedness associated with its occurrence. Low-probability events are highly surprising and therefore carry substantial information content, as their realization resolves a greater degree of prior doubt. In contrast, high-probability events are anticipated and contribute only minimal information, reflecting their lack of novelty. This perspective emphasizes that information arises from the resolution of rather than mere occurrence. The foundational concept of information as a measure of surprise originated in Claude Shannon's 1948 paper "," which described the information provided by an outcome in probabilistic terms to model efficient communication systems. The term "surprisal" itself, denoting this specific quantity, was coined by Myron Tribus in his 1961 book Thermostatics and Thermodynamics, where he applied information-theoretic ideas to physical systems and engineering contexts. This terminology has since become standard in discussions of self-information. Surprisal differs from broader measures of in that it evaluates the surprise of a single, specific outcome rather than the overall unpredictability across an entire . While the latter captures average expected surprise, surprisal focuses on the instantaneous yield from one realization. This distinction highlights surprisal's role in pinpointing event-specific informativeness. Conceptually, surprisal represents the reduction in achieved upon observing the event, transforming prior probabilistic expectations into a definite state. This aligns with the formal definition of information content as the negative logarithm of the event's probability, underscoring its interpretive value in both communication and processes.

Mathematical Properties

Monotonicity with Respect to Probability

The information content of an event with probability pp, denoted I(p)=logpI(p) = -\log p, is a monotonically decreasing function of pp for 0<p10 < p \leq 1. Specifically, I(1)=0I(1) = 0, reflecting that a certain event carries no information, and limp0+I(p)=\lim_{p \to 0^+} I(p) = \infty, indicating that an impossible event would provide infinite information if observed. This monotonicity holds regardless of the logarithmic base used, as long as it is greater than 1. To see why I(p)I(p) is monotonically decreasing, consider the negative logarithm function f(x)=logxf(x) = -\log x, which is strictly decreasing on the interval (0,1](0, 1] because the logarithm itself is strictly increasing. Composing this with pp, an increasing parameter in that interval, preserves the decreasing nature of the overall function. Thus, as the probability pp increases, the information content I(p)I(p) decreases. This property has key implications for understanding information: rarer events, characterized by smaller pp, yield higher information content upon realization, as they are more surprising. In contrast, highly probable events provide minimal or no information, aligning with intuitive notions of surprise in communication. Qualitatively, the curve of I(p)I(p) versus pp (using the natural logarithm, for instance) resembles a hyperbola, starting asymptotically from infinity near p=0p = 0 and curving downward to reach zero at p=1p = 1. This shape underscores the unbounded growth of information for vanishing probabilities while ensuring finite values for all practical p>0p > 0.

Additivity for Independent Events

One key mathematical property of information content is its additivity when applied to independent events. For two independent random variables XX and YY, the information content of the joint outcome (X=x,Y=y)(X=x, Y=y) equals the sum of the individual information contents:
I(X=x,Y=y)=I(X=x)+I(Y=y).I(X=x, Y=y) = I(X=x) + I(Y=y).
This holds because independence ensures that the joint probability factors multiplicatively, preserving the logarithmic structure of the measure.
The derivation follows directly from the definition of information content. Given independence, the joint probability satisfies P(X=x,Y=y)=P(X=x)P(Y=y)P(X=x, Y=y) = P(X=x) \cdot P(Y=y). Substituting into the formula yields
I(X=x,Y=y)=logP(X=x,Y=y)=log(P(X=x)P(Y=y))=logP(X=x)logP(Y=y)=I(X=x)+I(Y=y),I(X=x, Y=y) = -\log P(X=x, Y=y) = -\log \bigl( P(X=x) \cdot P(Y=y) \bigr) = -\log P(X=x) - \log P(Y=y) = I(X=x) + I(Y=y),
where the logarithm's additive property over products is key. This result underscores the measure's compatibility with probabilistic independence.
This additivity enables the decomposition of joint information into independent marginal components, a principle central to information theory's applications in coding and compression. For instance, it justifies assigning code lengths in source coding that sum across independent symbols, optimizing average code length to approach the bound. The generalizes to any finite collection of independent events X1,X2,,XnX_1, X_2, \dots, X_n, where the joint information content is the linear sum: I(X1=x1,,Xn=xn)=i=1nI(Xi=xi)I(X_1=x_1, \dots, X_n=x_n) = \sum_{i=1}^n I(X_i=x_i). This extension supports scalable analyses in multi-source communication systems.

Relationship to Log-Odds

In the context of binary events, where an outcome occurs with probability pp and fails to occur with probability 1p1 - p, the log-odds is defined as the logarithm of the , given by log(p1p)\log \left( \frac{p}{1-p} \right). This measure quantifies the relative likelihood of the event versus its complement on a , providing a symmetric transformation that maps probabilities in (0,1) to line. The information content, or surprisal, for a binary event with probability pp is I(p)=logpI(p) = -\log p, representing the surprise associated with observing . Similarly, the surprisal for the is I(1p)=log(1p)I(1-p) = -\log (1-p). The log-odds can then be expressed directly in terms of these surprisals as log(p1p)=I(1p)I(p)\log \left( \frac{p}{1-p} \right) = I(1-p) - I(p). This relation highlights the log-odds as the difference between the surprisals of the two possible outcomes, underscoring an inherent asymmetry in the information content unless p=0.5p = 0.5, where I(p)=I(1p)I(p) = I(1-p) and the log-odds is zero. Additionally, the sum of the surprisals for the binary outcomes yields I(p)+I(1p)=log(1p(1p))I(p) + I(1-p) = \log \left( \frac{1}{p(1-p)} \right), which captures the total information scale for the partition into success and failure. This tie emphasizes how information content structures the uncertainty across the binary possibilities. This connection finds practical application in logistic regression, where model coefficients are interpreted as changes in the log-odds of the outcome given predictors, effectively modeling shifts in the relative surprisals between classes. Such odds-based measures extend to information criteria in model selection, where deviations in surprisal differences inform predictive asymmetry. The framework highlights the non-symmetric nature of surprisal in binary scenarios, influencing how improbable events contribute disproportionately to log-odds variations. While log-odds is inherently tied to binary partitions, the underlying concept of contrasting surprisals informs broader information measures for non-binary cases, such as in partition-based decompositions where outcomes are grouped into mutually exclusive categories.

Connections to Information Theory

Relation to Entropy

The Shannon entropy H(X)H(X) of a discrete random variable XX emerges as the expected value of the information content I(X=x)I(X=x) over the probability distribution P(X=x)P(X=x). This relationship is expressed mathematically as H(X)=xP(X=x)I(X=x)=xP(X=x)logP(X=x),H(X) = \sum_x P(X=x) \, I(X=x) = -\sum_x P(X=x) \log P(X=x), where the summation is taken over all possible outcomes xx in the support of XX. This formulation arises because entropy serves to quantify the average uncertainty inherent in the random variable XX, while the information content provides the specific measure of uncertainty resolved upon observing a particular outcome xx. By computing the expectation, H(X)H(X) averages the surprisal across outcomes, weighted by their probabilities, thereby capturing the overall informational unpredictability of the distribution. Through this expectation, entropy inherits essential properties from the information content function. For instance, the non-negativity of I(X=x)I(X=x) ensures that H(X)0H(X) \geq 0, reflecting that cannot be negative. Similarly, the additivity of information content for independent events transfers to entropy, yielding H(X,Y)=H(X)+H(Y)H(X,Y) = H(X) + H(Y) when XX and YY are independent random variables. Claude Shannon originally derived the formula in 1948 by positing axioms for an measure, including continuity in probabilities, monotonicity under refinement of partitions, and additivity for independent ensembles, which uniquely determine the surprisal-based expression.

Units of Information

The unit of information content depends on the base bb of the logarithm in its definition I(p)=logbpI(p)=-\log_b p. When b=2b=2, the resulting unit is the bit (also called the shannon). For b=eb=e, the unit is the nat. When b=10b=10, the unit is the ban (also known as the decit or hartley, though rarely used in contemporary work). Conversions between these units follow from the change-of-base formula for logarithms. Specifically, 1 nat =log2e1.4427=\log_2 e\approx1.4427 bits, while 1 bit=loge20.693=\log_e 2\approx0.693 nats. Likewise, 1 ban=log2103.3219=\log_2 10\approx3.3219 bits, while 1 bit0.3010\approx0.3010 bans. In , bits serve as the conventional unit for digital systems and communication applications, aligning with binary representations, whereas nats are favored in theoretical and continuous-domain analyses for their compatibility with natural logarithms. These units extend to H(X)H(X), the expected information content of a XX, where the numerical value scales with the base but preserves conceptual properties. The choice of base does not affect qualitative aspects of information content, such as its monotonicity in probability, since the measure differs only by a multiplicative constant logbe\log_b e.

Illustrative Examples

Fair Coin Toss and Die Roll

In the case of a toss, each outcome—heads or tails—occurs with probability p=0.5p = 0.5. The information content for observing heads (or tails) is given by I=log2(0.5)=1I = -\log_2(0.5) = 1 bit, quantifying the surprisal of an event that is equally likely. The average information content, or , over both outcomes is thus 1 bit, reflecting the resolved by the toss. For a fair six-sided die roll, each face has probability p=1/6p = 1/6. The information content for any specific face is I=log2(1/6)2.585I = -\log_2(1/6) \approx 2.585 bits, indicating greater surprisal due to the lower probability compared to the . This value is the same for all faces, and the average is approximately 2.585 bits. Coin outcomes carry less information content than die outcomes because their higher probability makes them less surprising, consistent with the monotonic decrease of information content as probability increases. The following table compares the probabilities and information contents for these uniform cases:
ExperimentNumber of OutcomesProbability per OutcomeInformation Content per Outcome (bits)
20.51
Fair Six-Sided Die61/6 (≈0.1667)≈2.585

Dice Rolls: Frequency and Sum

When two fair six-sided are rolled independently, each specific outcome pair, such as (1,1) or (3,4), has a probability of 136\frac{1}{36}, yielding an information content of I=log2(136)=log2(36)5.17I = -\log_2\left(\frac{1}{36}\right) = \log_2(36) \approx 5.17 bits. This value arises from the additivity property for independent events, where the self-information of the outcome equals the sum of the self-informations: 2×log2(6)2×2.585=5.172 \times \log_2(6) \approx 2 \times 2.585 = 5.17 bits per die. In contrast, consider the information content associated with the sum of the two dice faces, which ranges from 2 to 12 but with unequal probabilities due to the varying number of ways each sum can occur. For instance, the sum of 7 has the highest probability of 636=16\frac{6}{36} = \frac{1}{6}, resulting in I=log2(6)2.585I = \log_2(6) \approx 2.585 bits, reflecting its relative unsurprising nature. Conversely, the sum of 2 (or 12) has the lowest probability of 136\frac{1}{36}, giving I5.17I \approx 5.17 bits, which conveys more surprise as an outcome. This demonstrates how the information content for a derived like the sum depends solely on its marginal probability, independent of the underlying joint distribution details. The following table summarizes the probabilities and corresponding information contents for all possible sums of two fair :
SumProbability ppInformation Content I=log2(p)I = -\log_2(p) (bits)
2136\frac{1}{36}≈5.170
3236\frac{2}{36}≈4.170
4336\frac{3}{36}≈3.585
5436\frac{4}{36}≈3.170
6536\frac{5}{36}≈2.847
7636\frac{6}{36}≈2.585
8536\frac{5}{36}≈2.847
9436\frac{4}{36}≈3.170
10336\frac{3}{36}≈3.585
11236\frac{2}{36}≈4.170
12136\frac{1}{36}≈5.170
These values are computed using the standard formula for self-information, with probabilities derived from the 36 equally likely outcomes. For repeated independent dice rolls, the information content can also quantify the surprisal of observing a specific distribution of outcomes, such as the exact number of times a particular face (e.g., six) appears in nn rolls. This follows a , where the probability of exactly kk successes (sixes) is P(K=k)=(nk)(16)k(56)nkP(K=k) = \binom{n}{k} \left(\frac{1}{6}\right)^k \left(\frac{5}{6}\right)^{n-k}, and the self-information is I=log2P(K=k)I = -\log_2 P(K=k). For example, in 2 rolls, observing exactly 2 sixes has P=(16)2=136P= \left(\frac{1}{6}\right)^2 = \frac{1}{36}, so I5.17I \approx 5.17 bits, aligning with the joint outcome surprisal for a double six due to independence. Observing 0 sixes yields P=(56)2=2536P= \left(\frac{5}{6}\right)^2 = \frac{25}{36}, with I0.526I \approx 0.526 bits, indicating low surprise for this more probable . Such applications highlight how self-information extends to aggregated counts in sequential trials, emphasizing rarity in frequency observations.

Uniform and Categorical Distributions

In the context of discrete probability distributions, the information content associated with a specific outcome in a uniform distribution is constant across all possible outcomes. For a over nn equally likely outcomes, each with probability P(xi)=1nP(x_i) = \frac{1}{n}, the information content of any outcome xix_i is given by I(xi)=logb(1n)=logbnI(x_i) = -\log_b \left( \frac{1}{n} \right) = \log_b n, where bb is the base of the logarithm (commonly 2 for bits). This constancy reflects the inherent of the distribution, where no outcome is more or less surprising than another. A special case arises when n=1n = 1, corresponding to a constant random variable with probability 1 for the single outcome. Here, the information content is I(x)=logb1=0I(x) = -\log_b 1 = 0, indicating that observing the certain outcome conveys no new information, as it is fully anticipated. The categorical distribution extends this concept to cases where outcomes have unequal probabilities p1,p2,,pkp_1, p_2, \dots, p_k that sum to 1. The information content for the ii-th outcome is Ii=logbpiI_i = -\log_b p_i, which varies depending on the specific probability pip_i; rarer outcomes (smaller pip_i) yield higher information content due to their greater surprise value. This variability aligns with the monotonicity property, where information content increases as the probability decreases. The uniform distribution represents a special instance of the in which all pi=1kp_i = \frac{1}{k} are equal, resulting in uniform information content logbk\log_b k for every outcome and eliminating heterogeneity. In contrast, a general exhibits heterogeneous information content, with values differing across categories based on their probabilities, thereby capturing nuanced differences in surprise across outcomes.

Derivation and Justification

Axiomatic Approach

The axiomatic approach defines the information content, or surprisal, I(p)I(p) of an event with probability pp through a set of properties analogous to those Shannon proposed for . These include continuity of I(p)I(p) with respect to p(0,1]p \in (0,1], monotonicity such that I(p)I(p) is nonincreasing in pp (with I(1)=0I(1) = 0), and additivity for the probabilities of independent events: I(pq)=I(p)+I(q)I(pq) = I(p) + I(q) for all 0<p,q10 < p, q \leq 1. These axioms capture the intuitive notion that information should accumulate additively under independence, vary smoothly, and be zero for certain events while increasing as events become less likely. The additivity axiom I(pq)=I(p)+I(q)I(pq) = I(p) + I(q) forms a multiplicative Cauchy functional equation. Assuming continuity (or the weaker monotonicity condition), the general solution is I(p)=clogpI(p) = -c \log p for some constant c>0c > 0, where the logarithm can be in any base and the sign ensures nonnegativity. Monotonicity further guarantees c>0c > 0, as I(p)I(p) must increase as pp decreases. The constant cc corresponds to a scaling factor that determines the , such as bits for base-2 logarithm with c=1c=1. This form originates directly from Claude Shannon's 1948 foundational paper, where the surprisal of an individual outcome satisfies the component axioms of at the single-event level, leading to the logarithmic measure of surprise. Up to the choice of base (which scales cc), this logarithmic function is the unique continuous solution satisfying the axioms.

Probabilistic Foundations

In a probabilistic setting, the information content I(x)I(x) of an outcome xx is defined as the negative logarithm of its probability, I(x)=logp(x)I(x) = -\log p(x), where p(x)p(x) is the probability of xx under the given distribution. This measures the surprise or amount of resolved by observing xx, with rarer outcomes (lower p(x)p(x)) carrying higher information content. This definition extends to model evaluation, such as in , where a low probability p(x)p(x) under a hypothesized model implies high information content, signaling a poor fit and motivating model revision or rejection. In this sense, I(x)I(x) captures the degree to which the observed data deviates from expectations under the model. A modern extension connects this probabilistic information content to , where I(x)=logp(x)I(x) = -\log p(x) under a probabilistic model approximates the description length of xx, akin to Kolmogorov complexity K(x)K(x), which quantifies the length of the shortest program generating xx; under a universal prior, the two measures converge for typical sequences. The definition ensures desirable properties aligned with : non-negativity holds since 0<p(x)10 < p(x) \leq 1 implies logp(x)0-\log p(x) \geq 0 (for logarithm base greater than 1), reflecting that no event can convey negative , while for impossible events with p(x)=0p(x) = 0, I(x)=I(x) = \infty, indicating infinite surprise as such outcomes defy the . For continuous random variables, the analog defines I(x)=logf(x)I(x) = -\log f(x), where f(x)f(x) is the , which under averaging yields , though the primary focus remains on discrete cases. This probabilistic motivation aligns with the axiomatic form I(x)=logp(x)I(x) = -\log p(x).
Add your contribution
Related Hubs
User Avatar
No comments yet.