Hubbry Logo
Exponential familyExponential familyMain
Open search
Exponential family
Community hub
Exponential family
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Exponential family
Exponential family
from Wikipedia

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family",[1] or the older term Koopman–Darmois family. Sometimes loosely referred to as the exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

The concept of exponential families is credited to[2] E. J. G. Pitman,[3] G. Darmois,[4] and B. O. Koopman[5] in 1935–1936. Exponential families of distributions provide a general framework for selecting a possible alternative parameterisation of a parametric family of distributions, in terms of natural parameters, and for defining useful sample statistics, called the natural sufficient statistics of the family.

Nomenclature difficulty

[edit]

The terms "distribution" and "family" are often used loosely: Specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter;[a] however, a parametric family of distributions is often referred to as "a distribution" (like "the normal distribution", meaning "the family of normal distributions"), and the set of all exponential families is sometimes loosely referred to as "the" exponential family.

Definition

[edit]

Most of the commonly used distributions form an exponential family or subset of an exponential family, listed in the subsection below. The subsections following it are a sequence of increasingly more general mathematical definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.

Examples of exponential family distributions

[edit]

Exponential families include many of the most common distributions. Among many others, exponential families includes the following:[6]

A number of common distributions are exponential families, but only when certain parameters are fixed and known. For example:

Note that in each case, the parameters which must be fixed are those that set a limit on the range of values that can possibly be observed.

Examples of common distributions that are not exponential families are Student's t, most mixture distributions, and even the family of uniform distributions when the bounds are not fixed. See the section below on examples for more discussion.

Scalar parameter

[edit]

The value of is called the parameter of the family.

A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form

where T(x), h(x), η(θ), and A(θ) are known functions. The function h(x) must be non-negative.

An alternative, equivalent form often given is

or equivalently

In terms of log probability,

Note that and .

Support must be independent of θ

[edit]

Importantly, the support of (all the possible values for which is greater than ) is required to not depend on [7] This requirement can be used to exclude a parametric family distribution from being an exponential family.

For example: The Pareto distribution has a pdf which is defined for (the minimum value, being the scale parameter) and its support, therefore, has a lower limit of Since the support of is dependent on the value of the parameter, the family of Pareto distributions does not form an exponential family of distributions (at least when is unknown).

Another example: Bernoulli-type distributions – binomial, negative binomial, geometric distribution, and similar – can only be included in the exponential class if the number of Bernoulli trials, n, is treated as a fixed constant – excluded from the free parameter(s) – since the allowed number of trials sets the limits for the number of "successes" or "failures" that can be observed in a set of trials.

Vector valued x and θ

[edit]

Often is a vector of measurements, in which case may be a function from the space of possible values of to the real numbers.

More generally, and can each be vector-valued such that is real-valued. However, see the discussion below on vector parameters, regarding the curved exponential family.

Canonical formulation

[edit]

If then the exponential family is said to be in canonical form. By defining a transformed parameter it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since can be multiplied by any nonzero constant, provided that T(x) is multiplied by that constant's reciprocal, or a constant c can be added to and h(x) multiplied by to offset it. In the special case that and T(x) = x, then the family is called a natural exponential family.

Even when is a scalar, and there is only a single parameter, the functions and can still be vectors, as described below.

The function or equivalently is automatically determined once the other functions have been chosen, since it must assume a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of even when is not a one-to-one function, i.e. two or more different values of map to the same value of and hence cannot be inverted. In such a case, all values of mapping to the same will also have the same value for and

Factorization of the variables involved

[edit]

What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms:

where f and h are arbitrary functions of x, the observed statistical variable; g and j are arbitrary functions of the fixed parameters defining the shape of the distribution; and c is any arbitrary constant expression (i.e. a number or an expression that does not change with either x or ).

There are further restrictions on how many such factors can occur. For example, the two expressions:

are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,

it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a curved exponential family, which allows multiple factorized terms in the exponent.[citation needed])

To see why an expression of the form

qualifies,

and hence factorizes inside of the exponent. Similarly,

and again factorizes inside of the exponent.

A factor consisting of a sum where both types of variables are involved (e.g. a factor of the form ) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the Cauchy distribution and Student's t distribution are not exponential families.

Vector parameter

[edit]

The definition in terms of one real-number parameter can be extended to one real-vector parameter

A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as

or in a more compact form,

This form writes the sum as a dot product of vector-valued functions and T(x).

An alternative, equivalent form often seen is

As in the scalar valued case, the exponential family is said to be in canonical form if

A vector exponential family is said to be curved if the dimension of

is less than the dimension of the vector

That is, if the dimension, d, of the parameter vector is less than the number of functions, s, of the parameter vector in the above representation of the probability density function. Most common distributions in the exponential family are not curved, and many algorithms designed to work with any exponential family implicitly or explicitly assume that the distribution is not curved.

Just as in the case of a scalar-valued parameter, the function or equivalently is automatically determined by the normalization constraint, once the other functions have been chosen. Even if is not one-to-one, functions and can be defined by requiring that the distribution is normalized for each value of the natural parameter . This yields the canonical form

or equivalently

The above forms may sometimes be seen with in place of . These are exactly equivalent formulations, merely using different notation for the dot product.

Vector parameter, vector variable

[edit]

The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar x replaced by the vector

The dimensions k of the random variable need not match the dimension d of the parameter vector, nor (in the case of a curved exponential function) the dimension s of the natural parameter and sufficient statistic T(x) .

The distribution in this case is written as

Or more compactly as

Or alternatively as

Measure-theoretic formulation

[edit]

We use cumulative distribution functions (CDF) in order to encompass both discrete and continuous distributions.

Suppose H is a non-decreasing function of a real variable. Then Lebesgue–Stieltjes integrals with respect to are integrals with respect to the reference measure of the exponential family generated by H .

Any member of that exponential family has cumulative distribution function

H(x) is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is absolutely continuous with a density with respect to a reference measure (typically Lebesgue measure), one can write . In this case, H is also absolutely continuous and can be written so the formulas reduce to that of the previous paragraphs. If F is discrete, then H is a step function (with steps on the support of F).

Alternatively, we can write the probability measure directly as

for some reference measure .

Interpretation

[edit]

In the definitions above, the functions T(x), η(θ), and A(η) were arbitrary. However, these functions have important interpretations in the resulting probability distribution.

  • T(x) is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that holds all information the data x provides with regard to the unknown parameter values. This means that, for any data sets and , the likelihood ratio is the same, that is if T(x) = T(y). This is true even if x and y are not equal to each other. The dimension of T(x) equals the number of parameters of θ and encompasses all of the information regarding the data related to the parameter θ. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data (and hence to derive any desired estimate of the parameters). (This important property is discussed further below.)
  • η is called the natural parameter. The set of values of η for which the function is integrable is called the natural parameter space. It can be shown that the natural parameter space is always convex.
  • A(η) is called the log-partition function[b] because it is the logarithm of a normalization factor, without which would not be a probability distribution:

The function A is important in its own right, because the mean, variance and other moments of the sufficient statistic T(x) can be derived simply by differentiating A(η). For example, because log(x) is one of the components of the sufficient statistic of the gamma distribution, can be easily determined for this distribution using A(η). Technically, this is true because is the cumulant generating function of the sufficient statistic.

Properties

[edit]

Exponential families have a large number of properties that make them extremely useful for statistical analysis. In many cases, it can be shown that only exponential families have these properties. Examples:

Given an exponential family defined by , where is the parameter space, such that . Then

  • If has nonempty interior in , then given any IID samples , the statistic is a complete statistic for .[9][10]
  • is a minimal statistic for if and only if for all , and in the support of , if , then or .[11]

Examples

[edit]

It is critical, when considering the examples in this section, to remember the discussion above about what it means to say that a "distribution" is an exponential family, and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a "distribution" is or is not an exponential family.

The normal, exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, ALAAM, von Mises, and von Mises-Fisher distributions are all exponential families.

Some distributions are exponential families only if some of their parameters are held fixed. The family of Pareto distributions with a fixed minimum bound xm form an exponential family. The families of binomial and multinomial distributions with fixed number of trials n but unknown probability parameter(s) are exponential families. The family of negative binomial distributions with fixed number of failures (a.k.a. stopping-time parameter) r is an exponential family. However, when any of the above-mentioned fixed parameters are allowed to vary, the resulting family is not an exponential family.

As mentioned above, as a general rule, the support of an exponential family must remain the same across all parameter settings in the family. This is why the above cases (e.g. binomial with varying number of trials, Pareto with varying minimum bound) are not exponential families — in all of the cases, the parameter in question affects the support (particularly, changing the minimum or maximum possible value). For similar reasons, neither the discrete uniform distribution nor continuous uniform distribution are exponential families as one or both bounds vary.

The Weibull distribution with fixed shape parameter k is an exponential family. Unlike in the previous examples, the shape parameter does not affect the support; the fact that allowing it to vary makes the Weibull non-exponential is due rather to the particular form of the Weibull's probability density function (k appears in the exponent of an exponent).

In general, distributions that result from a finite or infinite mixture of other distributions, e.g. mixture model densities and compound probability distributions, are not exponential families. Examples are typical Gaussian mixture models as well as many heavy-tailed distributions that result from compounding (i.e. infinitely mixing) a distribution with a prior distribution over one of its parameters, e.g. the Student's t-distribution (compounding a normal distribution over a gamma-distributed precision prior), and the beta-binomial and Dirichlet-multinomial distributions. Other examples of distributions that are not exponential families are the F-distribution, Cauchy distribution, hypergeometric distribution and logistic distribution.

Following are some detailed examples of the representation of some useful distribution as exponential families.

Normal distribution: unknown mean, known variance

[edit]

As a first example, consider a random variable distributed normally with unknown mean μ and known variance σ2. The probability density function is then

This is a single-parameter exponential family, as can be seen by setting

If σ = 1 this is in canonical form, as then η(μ) = μ.

Normal distribution: unknown mean and unknown variance

[edit]

Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then

This is an exponential family which can be written in canonical form by defining

Binomial distribution

[edit]

As an example of a discrete exponential family, consider the binomial distribution with known number of trials n. The probability mass function for this distribution is This can equivalently be written as which shows that the binomial distribution is an exponential family, whose natural parameter is This function of p is known as logit.

Table of distributions

[edit]

The following table shows how to rewrite a number of common distributions as exponential-family distributions with natural parameters. Refer to the flashcards[12] for main exponential families.

For a scalar variable and scalar parameter, the form is as follows:

For a scalar variable and vector parameter:

For a vector variable and vector parameter:

The above formulas choose the functional form of the exponential-family with a log-partition function . The reason for this is so that the moments of the sufficient statistics can be calculated easily, simply by differentiating this function. Alternative forms involve either parameterizing this function in terms of the normal parameter instead of the natural parameter, and/or using a factor outside of the exponential. The relation between the latter and the former is: To convert between the representations involving the two types of parameter, use the formulas below for writing one type of parameter in terms of the other.

Distribution Parameter(s) θ Natural parameter(s) η Inverse parameter mapping Base measure h(x) Sufficient statistic T(x) Log-partition A(η) Log-partition A(θ)
Bernoulli distribution
This is the logit function.

This is the logistic function.
binomial distribution
with known number of trials
Poisson distribution
negative binomial distribution
with known number of failures
exponential distribution
Pareto distribution
with known minimum value
Weibull distribution
with known shape k
Laplace distribution
with known mean
chi-squared distribution
normal distribution
known variance
continuous Bernoulli distribution


where log2 refers to the iterated logarithm

normal distribution
log-normal distribution
inverse Gaussian distribution
gamma distribution
inverse gamma distribution
generalized inverse Gaussian distribution
scaled inverse chi-squared distribution
beta distribution

(variant 1)
beta distribution

(variant 2)
multivariate normal distribution
categorical distribution

(variant 1)


where


where
is the Iverson bracket[i]
categorical distribution

(variant 2)


where
where is the Iverson bracket[i]
categorical distribution

(variant 3)


where

This is the inverse softmax function, a generalization of the logit function.



where and .

This is the softmax function, a generalization of the logistic function.

is the Iverson bracket[i]
multinomial distribution
(variant 1)
with known number of trials n


where


where
multinomial distribution
(variant 2)
with known number of trials


where

where

multinomial distribution
(variant 3)
with known number of trials


where

where and

Dirichlet distribution
(variant 1)
Dirichlet distribution
(variant 2)
Wishart distribution


Three variants with different parameterizations are given, to facilitate computing moments of the sufficient statistics.

Note: Uses the fact that i.e. the trace of a matrix product is much like a dot product. The matrix parameters are assumed to be vectorized (laid out in a vector) when inserted into the exponential form. Also, and are symmetric, so e.g.
inverse Wishart distribution
normal-gamma distribution
  1. ^ a b c The Iverson bracket is a generalization of the discrete delta-function: If the bracketed expression is true, the bracket has value 1; if the enclosed statement is false, the Iverson bracket is zero. There are many variant notations, e.g. wavey brackets: a=b is equivalent to the [a=b] notation used above.

The three variants of the categorical distribution and multinomial distribution are due to the fact that the parameters are constrained, such that

Thus, there are only independent parameters.

  • Variant 1 uses natural parameters with a simple relation between the standard and natural parameters; however, only of the natural parameters are independent, and the set of natural parameters is nonidentifiable. The constraint on the usual parameters translates to a similar constraint on the natural parameters.
  • Variant 2 demonstrates the fact that the entire set of natural parameters is nonidentifiable: Adding any constant value to the natural parameters has no effect on the resulting distribution. However, by using the constraint on the natural parameters, the formula for the normal parameters in terms of the natural parameters can be written in a way that is independent on the constant that is added.
  • Variant 3 shows how to make the parameters identifiable in a convenient way by setting This effectively "pivots" around and causes the last natural parameter to have the constant value of 0. All the remaining formulas are written in a way that does not access , so that effectively the model has only parameters, both of the usual and natural kind.

Variants 1 and 2 are not actually standard exponential families at all. Rather they are curved exponential families, i.e. there are independent parameters embedded in a -dimensional parameter space.[13] Many of the standard results for exponential families do not apply to curved exponential families. An example is the log-partition function , which has the value of 0 in the curved cases. In standard exponential families, the derivatives of this function correspond to the moments (more technically, the cumulants) of the sufficient statistics, e.g. the mean and variance. However, a value of 0 suggests that the mean and variance of all the sufficient statistics are uniformly 0, whereas in fact the mean of the th sufficient statistic should be . (This does emerge correctly when using the form of shown in variant 3.)

Moments and cumulants of the sufficient statistic

[edit]

Normalization of the distribution

[edit]

We start with the normalization of the probability distribution. In general, any non-negative function f(x) that serves as the kernel of a probability distribution (the part encoding all dependence on x) can be made into a proper distribution by normalizing: i.e.

where

The factor Z is sometimes termed the normalizer or partition function, based on an analogy to statistical physics.

In the case of an exponential family where

the kernel is and the partition function is

Since the distribution must be normalized, we have

In other words, or equivalently

This justifies calling A the log-normalizer or log-partition function.

Moment-generating function of the sufficient statistic

[edit]

Now, the moment-generating function of T(x) is

proving the earlier statement that

is the cumulant generating function for T.

An important subclass of exponential families are the natural exponential families, which have a similar form for the moment-generating function for the distribution of x.

Differential identities for cumulants

[edit]

In particular, using the properties of the cumulant generating function,

and

The first two raw moments and all mixed second moments can be recovered from these two identities. Higher-order moments and cumulants are obtained by higher derivatives. This technique is often useful when T is a complicated function of the data, whose moments are difficult to calculate by integration.

Another way to see this that does not rely on the theory of cumulants is to begin from the fact that the distribution of an exponential family must be normalized, and differentiate. We illustrate using the simple case of a one-dimensional parameter, but an analogous derivation holds more generally.

In the one-dimensional case, we have

This must be normalized, so

Take the derivative of both sides with respect to η:

Therefore,

Example 1

[edit]

As an introductory example, consider the gamma distribution, whose distribution is defined by

Referring to the above table, we can see that the natural parameter is given by

the reverse substitutions are

the sufficient statistics are (log x, x), and the log-partition function is

We can find the mean of the sufficient statistics as follows. First, for η1:

Where is the digamma function (derivative of log gamma), and we used the reverse substitutions in the last step.

Now, for η2:

again making the reverse substitution in the last step.

To compute the variance of x, we just differentiate again:

All of these calculations can be done using integration, making use of various properties of the gamma function, but this requires significantly more work.

Example 2

[edit]

As another example consider a real valued random variable X with density

indexed by shape parameter (this is called the skew-logistic distribution). The density can be rewritten as

Notice this is an exponential family with natural parameter

sufficient statistic

and log-partition function

So using the first identity,

and using the second identity

This example illustrates a case where using this method is very simple, but the direct calculation would be nearly impossible.

Example 3

[edit]

The final example is one where integration would be extremely difficult. This is the case of the Wishart distribution, which is defined over matrices. Even taking derivatives is a bit tricky, as it involves matrix calculus, but the respective identities are listed in that article.

From the above table, we can see that the natural parameter is given by

the reverse substitutions are

and the sufficient statistics are

The log-partition function is written in various forms in the table, to facilitate differentiation and back-substitution. We use the following forms:

Expectation of X (associated with η1)

To differentiate with respect to η1, we need the following matrix calculus identity:

Then:

The last line uses the fact that V is symmetric, and therefore it is the same when transposed.

Expectation of log |X| (associated with η2)

Now, for η2, we first need to expand the part of the log-partition function that involves the multivariate gamma function:

We also need the digamma function:

Then:

This latter formula is listed in the Wishart distribution article. Both of these expectations are needed when deriving the variational Bayes update equations in a Bayes network involving a Wishart distribution (which is the conjugate prior of the multivariate normal distribution).

Computing these formulas using integration would be much more difficult. The first one, for example, would require matrix integration.

Entropy

[edit]

Relative entropy

[edit]

The relative entropy (Kullback–Leibler divergence, KL divergence) of two distributions in an exponential family has a simple expression as the Bregman divergence between the natural parameters with respect to the log-normalizer.[14] The relative entropy is defined in terms of an integral, while the Bregman divergence is defined in terms of a derivative and inner product, and thus is easier to calculate and has a closed-form expression (assuming the derivative has a closed-form expression). Further, the Bregman divergence in terms of the natural parameters and the log-normalizer equals the Bregman divergence of the dual parameters (expectation parameters), in the opposite order, for the convex conjugate function.[15]

Fixing an exponential family with log-normalizer (with convex conjugate ), writing for the distribution in this family corresponding a fixed value of the natural parameter (writing for another value, and with for the corresponding dual expectation/moment parameters), writing KL for the KL divergence, and for the Bregman divergence, the divergences are related as:

The KL divergence is conventionally written with respect to the first parameter, while the Bregman divergence is conventionally written with respect to the second parameter, and thus this can be read as "the relative entropy is equal to the Bregman divergence defined by the log-normalizer on the swapped natural parameters", or equivalently as "equal to the Bregman divergence defined by the dual to the log-normalizer on the expectation parameters".

Maximum-entropy derivation

[edit]

Exponential families arise naturally as the answer to the following question: what is the maximum-entropy distribution consistent with given constraints on expected values?

The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x).

The entropy of dF(x) relative to dH(x) is

or

where dF/dH and dH/dF are Radon–Nikodym derivatives. The ordinary definition of entropy for a discrete distribution supported on a set I, namely

assumes, though this is seldom pointed out, that dH is chosen to be the counting measure on I.

Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is an exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.

The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0.

For examples of such derivations, see Maximum entropy probability distribution.

Role in statistics

[edit]

Classical estimation: sufficiency

[edit]

According to the PitmanKoopmanDarmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases.

Less tersely, suppose Xk, (where k = 1, 2, 3, ... n) are independent, identically distributed random variables. Only if their distribution is one of the exponential family of distributions is there a sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases; the statistic T may be a vector or a single scalar number, but whatever it is, its size will neither grow nor shrink when more data are obtained.

As a counterexample if these conditions are relaxed, the family of uniform distributions (either discrete or continuous, with either or both bounds unknown) has a sufficient statistic, namely the sample maximum, sample minimum, and sample size, but does not form an exponential family, as the domain varies with the parameters.

Bayesian estimation: conjugate distributions

[edit]

Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to an exponential family there exists a conjugate prior, which is often also in an exponential family. A conjugate prior π for the parameter of an exponential family

is given by

or equivalently

where s is the dimension of and and are hyperparameters (parameters controlling parameters). corresponds to the effective number of observations that the prior distribution contributes, and corresponds to the total amount that these pseudo-observations contribute to the sufficient statistic over all observations and pseudo-observations. is a normalization constant that is automatically determined by the remaining functions and serves to ensure that the given function is a probability density function (i.e. it is normalized). and equivalently are the same functions as in the definition of the distribution over which π is the conjugate prior.

A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. It can however be represented by using a mixture density as the prior, here a combination of two beta distributions; this is a form of hyperprior.

An arbitrary likelihood will not belong to an exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.

To show that the above prior distribution is a conjugate prior, we can derive the posterior.

First, assume that the probability of a single observation follows an exponential family, parameterized using its natural parameter:

Then, for data , the likelihood is computed as follows:

Then, for the above conjugate prior:

We can then compute the posterior as follows:

The last line is the kernel of the posterior distribution, i.e.

This shows that the posterior has the same form as the prior.

The data X enters into this equation only in the expression

which is termed the sufficient statistic of the data. That is, the value of the sufficient statistic is sufficient to completely determine the posterior distribution. The actual data points themselves are not needed, and all sets of data points with the same sufficient statistic will have the same distribution. This is important because the dimension of the sufficient statistic does not grow with the data size — it has only as many components as the components of (equivalently, the number of parameters of the distribution of a single data point).

The update equations are as follows:

This shows that the update equations can be written simply in terms of the number of data points and the sufficient statistic of the data. This can be seen clearly in the various examples of update equations shown in the conjugate prior page. Because of the way that the sufficient statistic is computed, it necessarily involves sums of components of the data (in some cases disguised as products or other forms — a product can be written in terms of a sum of logarithms). The cases where the update equations for particular distributions don't exactly match the above forms are cases where the conjugate prior has been expressed using a different parameterization than the one that produces a conjugate prior of the above form — often specifically because the above form is defined over the natural parameter while conjugate priors are usually defined over the actual parameter

Unbiased estimation

[edit]

If the likelihood is an exponential family, then the unbiased estimator of is .[16]

Hypothesis testing: uniformly most powerful tests

[edit]

A one-parameter exponential family has a monotone non-decreasing likelihood ratio in the sufficient statistic T(x), provided that η(θ) is non-decreasing. As a consequence, there exists a uniformly most powerful test for testing the hypothesis H0: θθ0 vs. H1: θ < θ0.

Generalized linear models

[edit]

Exponential families form the basis for the distribution functions used in generalized linear models (GLM), a class of model that encompasses many of the commonly used regression models in statistics. Examples include logistic regression using the binomial family and Poisson regression.

See also

[edit]

Footnotes

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In probability and statistics, the exponential family is a broad class of parametric probability distributions whose densities or mass functions take the canonical form p(xη)=h(x)exp{ηT(x)A(η)}p(x \mid \eta) = h(x) \exp\left\{ \eta^\top T(x) - A(\eta) \right\}, where η\eta is the natural or canonical parameter, T(x)T(x) is the sufficient statistic, A(η)A(\eta) is the cumulant or log-normalizer function ensuring integrability, and h(x)h(x) is a base measure with respect to which the density is defined. This parameterization unifies many common distributions, such as the normal, binomial, Poisson, gamma, and multinomial, facilitating shared theoretical properties and computational tractability in statistical inference. The natural parameter space must be convex and open for the family to be regular, ensuring the cumulant function A(η)A(\eta) is strictly convex and differentiable, which allows moments like the mean E[T(X)]=A(η)\mathbb{E}[T(X)] = \nabla A(\eta) and variance Var(T(X))=2A(η)\mathrm{Var}(T(X)) = \nabla^2 A(\eta) to be derived directly from it. The concept of exponential families emerged in the 1930s through the independent work of Darmois, Pitman, and Koopman, formalized in the Pitman–Koopman–Darmois theorem, which characterizes them as the only distributions permitting a fixed-dimensional sufficient statistic for independent and identically distributed samples under mild regularity conditions. This theorem highlighted their role in dimension reduction and sufficiency, key pillars of modern statistical theory. Over the decades, exponential families gained prominence in generalized linear models (GLMs), where they serve as the basis for response distributions linked to linear predictors via a canonical link function, enabling applications in regression for diverse data types like counts, proportions, and continuous outcomes. Key properties of exponential families include the existence of minimal representations, where the parameter dimension matches the sufficient statistic's dimension, avoiding redundancy. They exhibit conjugacy in Bayesian inference, where conjugate priors often take an exponential family form, simplifying posterior computations. In machine learning and optimization, their convexity properties underpin algorithms like variational inference and expectation-maximization for probabilistic modeling. For instance, the Bernoulli distribution (for binary data) has natural parameter η=log(p/(1p))\eta = \log(p/(1-p)) and sufficient statistic T(x)=xT(x) = x, while the Poisson has η=logλ\eta = \log\lambda and T(x)=xT(x) = x. These features make exponential families foundational for parametric statistics, with ongoing extensions to non-parametric and high-dimensional settings.

Definition and Formulation

General Definition

The exponential family of distributions was introduced independently in the mid-1930s by Georges Darmois, Edwin J. G. Pitman, and Bernard O. Koopman as a class of parametric models admitting sufficient statistics of fixed dimension, thereby generalizing various probability distributions to simplify statistical inference. This unification arose from efforts to characterize distributions where the dimension of the sufficient statistic does not grow with sample size, a property central to efficient estimation and testing. A distribution belongs to the exponential family if its probability density function (for continuous random variables) or probability mass function (for discrete random variables) with respect to a suitable measure can be expressed in the form f(xθ)=h(x)exp{η(θ)T(x)A(θ)},f(x \mid \theta) = h(x) \exp\left\{ \eta(\theta) \cdot T(x) - A(\theta) \right\}, where θ\theta is the parameter, h(x)h(x) is the base measure (a non-negative function independent of θ\theta), η(θ)\eta(\theta) is the natural parameter (a function of θ\theta), T(x)T(x) is the sufficient statistic (a function of the observation xx), and A(θ)A(\theta) is the log-partition function. The base measure h(x)h(x) ensures that the expression is integrable over the support of xx and incorporates any parts of the density not depending on θ\theta. The natural parameter η(θ)\eta(\theta) reparameterizes θ\theta to directly associate it with the sufficient statistic, while T(x)T(x) summarizes the information in xx relevant to θ\theta. The log-partition function A(θ)A(\theta) is defined as A(θ)=logh(x)exp{η(θ)T(x)}dx,A(\theta) = \log \int h(x) \exp\left\{ \eta(\theta) \cdot T(x) \right\} \, dx, where the integral (or sum, for discrete cases) is taken over the support of xx. The normalization condition follows directly from the definition of A(θ)A(\theta): integrating the density yields \int f(x \mid \theta) \, dx = \int h(x) \exp\left\{ \eta(\theta) \cdot T(x) - A(\theta) \right\} \, dx = e^{-A(\theta)} \int h(x) \exp\left\{ \eta(\theta) \cdot T(x) \right\} \, dx = e^{-A(\theta)} \cdot e^{A(\theta)} = 1. $$ For the family to retain its exponential structure as $\theta$ varies, the support of $x$ must be independent of $\theta$; otherwise, the form would not hold uniformly across the parameter space. This parameterization is advantageous because it reveals common structural properties among diverse distributions, such as the form of the maximum likelihood estimator and the variance of sufficient statistics, enabling unified theoretical developments in statistics.[](https://www.stat.umn.edu/geyer/5421/notes/expfam.html) ### Canonical Parameterization The canonical parameterization of an exponential family reparameterizes the general form by expressing the parameter directly in terms of the natural parameter $\eta$, yielding the density $f(x \mid \eta) = h(x) \exp\{ \eta \cdot T(x) - A(\eta) \}$, where $\theta$ from the original parameterization is mapped via $\eta = \Phi(\theta)$, and $A(\eta)$ is the log-partition function ensuring normalization.[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf) This form simplifies computations in statistical inference by aligning the parameter with the sufficient statistic $T(x)$, with the support of the distribution independent of $\eta$ to maintain regularity.[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) In the scalar parameter case, the density takes the explicit form $f(x \mid \eta) = h(x) \exp\{ \eta T(x) - A(\eta) \}$, where $\eta \in \mathcal{N}$ is the natural parameter in an open interval (the natural parameter space), $T(x)$ is the scalar sufficient statistic, and $A(\eta) = \log \int h(x) \exp\{ \eta T(x) \} \nu(dx)$ for the underlying measure $\nu$; this requires the integral to be finite for all $\eta \in \mathcal{N}$ to ensure the family is regular.[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) For a vector-valued sufficient statistic $T(x) = (T_1(x), \dots, T_k(x))$ of dimension $k$, the canonical parameter $\eta = (\eta_1, \dots, \eta_k)$ is also $k$-dimensional, and the density becomes $f(x \mid \eta) = h(x) \exp\{ \eta \cdot T(x) - A(\eta) \}$, with the inner product $\eta \cdot T(x) = \sum_{i=1}^k \eta_i T_i(x)$; here, $\mathcal{N} \subseteq \mathbb{R}^k$ is open and convex to guarantee differentiability of $A(\eta)$.[](https://www.cs.cmu.edu/~epxing/Class/10708-09/lecture/lecture7.pdf)[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) When the original parameter $\theta$ is $m$-dimensional, the natural parameter is $\eta(\theta) = (\eta_1(\theta_1, \dots, \theta_m), \dots, \eta_k(\theta_1, \dots, \theta_m))$, and the exponent expands to $\sum_{i=1}^k \eta_i(\theta) T_i(x) - A(\eta(\theta))$, allowing the family to capture multiparameter dependencies through this mapping.[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) This canonical form arises directly from the Neyman-Fisher factorization theorem, which states that a statistic $T(x)$ is sufficient for $\theta$ if the density factors as $f(x \mid \theta) = g(T(x), \theta) h(x)$; distributions admitting such a factorization for some $T(x)$ can always be reparameterized into the exponential form with $\eta$ tied to the components of $g$.[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf)[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf) The representation in canonical form is unique only up to affine transformations of $T(x)$ and $\eta$, such as rescaling $\eta' = c \eta$ and $T'(x) = T(x)/c$ for $c \neq 0$, which leaves the inner product $\eta \cdot T(x)$ invariant and thus the family unchanged.[](https://www.cs.cmu.edu/~epxing/Class/10708-09/lecture/lecture7.pdf)[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) ### Measure-Theoretic Version The measure-theoretic formulation of the exponential family provides a general framework for parametric families of probability distributions on a measurable space $(X, \mathcal{X})$, where $\mathcal{X}$ is a $\sigma$-algebra. Consider a $\sigma$-finite measure $\mu$ on $(X, \mathcal{X})$, and let $\{P_\theta : \theta \in \Theta\}$ be a family of probability measures absolutely continuous with respect to $\mu$. The family is exponential if there exist a positive $\mu$-integrable function $h: X \to (0, \infty)$, a statistic $T: X \to \mathbb{R}^k$, a parameter function $\eta: \Theta \to \mathbb{R}^k$, and a normalizing function $A: \Theta \to \mathbb{R}$ such that the density of $P_\theta$ with respect to $\mu$ is given by f(x \mid \theta) = h(x) \exp\left{ \eta(\theta) \cdot T(x) - A(\theta) \right} for $\mu$-almost all $x \in X$.[](https://sites.stat.washington.edu/jaw/COURSES/580s/582/HO/Lehmann_and_Romano-TestingStatisticalHypotheses.pdf)[](https://pure.au.dk/ws/files/51499534/Mon_52.pdf) The base measure $\mu$ plays a central role in this setup, serving as a dominating $\sigma$-finite measure that ensures the densities exist via the Radon-Nikodym theorem; $\sigma$-finiteness means $\mu$ can be expressed as a countable union of sets of finite measure, which guarantees integrability of $h$ and allows the formulation to encompass both continuous cases (e.g., Lebesgue measure) and discrete cases (e.g., counting measure). The function $h(x)$ is typically positive and $\mu$-integrable, absorbing any carrier measure effects, while the exponential form ensures the family is closed under certain transformations.[](https://sites.stat.washington.edu/jaw/COURSES/580s/582/HO/Lehmann_and_Romano-TestingStatisticalHypotheses.pdf)[](https://pure.au.dk/ws/files/51499534/Mon_52.pdf) A minimal exponential family representation is achieved when the dimension $k$ of the natural parameter $\eta$ and statistic $T$ matches the dimension of the family, requiring linear independence of the components $\{1, T_1, \dots, T_k\}$ with respect to $\mu$ and of $\{1, \eta_1, \dots, \eta_k\}$ over $\Theta$; this avoids redundant parameters and ensures the parameterization is full rank. Curved exponential families arise when the parameter space $\Theta$ is a curved subset of the full natural parameter space $\mathbb{R}^k$, such as $\eta(\beta)$ for $\beta$ in a lower-dimensional manifold, preserving the exponential structure but restricting the support.[](https://pure.au.dk/ws/files/51499534/Mon_52.pdf)[](https://sites.stat.washington.edu/jaw/COURSES/580s/582/HO/Lehmann_and_Romano-TestingStatisticalHypotheses.pdf) Regularity conditions are essential for analytic properties, including the openness of the natural parameter space and differentiability of the cumulant function $A(\eta)$, which ensures the existence of moments via $\nabla A(\eta) = \mathbb{E}_\eta[T(X)]$; steeper conditions, such as strict convexity of $A$, further guarantee unique parameterization. Exponential families are inherently dominated families, as all $P_\theta$ are absolutely continuous with respect to the base measure $\mu$, sharing the same null sets and enabling unified treatment across discrete and continuous settings.[](https://pure.au.dk/ws/files/51499534/Mon_52.pdf)[](https://sites.stat.washington.edu/jaw/COURSES/580s/582/HO/Lehmann_and_Romano-TestingStatisticalHypotheses.pdf) ## Properties ### Sufficiency and Factorization A fundamental result in statistical inference is the Neyman-Fisher factorization theorem, which provides a criterion for identifying sufficient statistics. The theorem states that a statistic $ T(\mathbf{x}) $ is sufficient for the parameter $ \theta $ if the joint probability density or mass function $ f(\mathbf{x} \mid \theta) $ can be factored as $ f(\mathbf{x} \mid \theta) = g(T(\mathbf{x}), \theta) \cdot h(\mathbf{x}) $, where $ g $ depends on $ \mathbf{x} $ only through $ T(\mathbf{x}) $ and on $ \theta $, while $ h $ depends only on $ \mathbf{x} $.[](http://homepages.math.uic.edu/~jyang06/stat411/handouts/Neyman_Fisher_Theorem.pdf)[](https://math.arizona.edu/~jwatkins/sufficiency.pdf) In the context of the exponential family, the density takes the form $ f(\mathbf{x} \mid \theta) = h(\mathbf{x}) \exp\left( \eta(\theta) \cdot T(\mathbf{x}) - A(\theta) \right) $, which directly satisfies the factorization criterion with $ g(T(\mathbf{x}), \theta) = \exp\left( \eta(\theta) \cdot T(\mathbf{x}) - A(\theta) \right) $ and $ h(\mathbf{x}) $ as the base measure. Thus, the statistic $ T(\mathbf{x}) $ is sufficient for $ \theta $.[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) To see why this implies sufficiency, consider the conditional distribution of $ \mathbf{x} $ given $ T(\mathbf{x}) = t $. This conditional density is $ f(\mathbf{x} \mid T(\mathbf{x}) = t, \theta) = \frac{f(\mathbf{x} \mid \theta)}{f_T(t \mid \theta)} $, where $ f_T $ is the marginal density of $ T $. Substituting the exponential form yields $ f(\mathbf{x} \mid T(\mathbf{x}) = t, \theta) = \frac{h(\mathbf{x}) \exp\left( \eta(\theta) \cdot t - A(\theta) \right)}{f_T(t \mid \theta)} $, which simplifies to a form independent of $ \theta $ because the $ \theta $-dependent terms cancel out. Therefore, the conditional distribution does not depend on $ \theta $, confirming that $ T(\mathbf{x}) $ captures all information about $ \theta $ from the data.[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)[](https://www.stat.umn.edu/geyer/5421/notes/expfam.html) Furthermore, in a full-rank exponential family, the canonical statistic $ T(\mathbf{x}) $ is minimal sufficient, meaning it is a function of every other sufficient statistic and has the smallest dimension among sufficient statistics of that dimension. This minimality follows from the completeness of $ T(\mathbf{x}) $, as a complete sufficient statistic is necessarily minimal sufficient.[](https://stat210a.berkeley.edu/fall-2024/reader/completeness.html)[](https://math.arizona.edu/~jwatkins/F4_completeness.pdf) Under standard regularity conditions, such as the support of the density being independent of $ \theta $ and the parameter space having positive Lebesgue measure, the exponential family is complete: if $ \mathbb{E}_\theta[\phi(T(\mathbf{x}))] = 0 $ for all $ \theta $, then $ \phi(t) = 0 $ almost surely with respect to the distribution of $ T $. This completeness property ensures that unbiased estimators based on $ T $ are unique, up to measure zero sets.[](https://stat210a.berkeley.edu/fall-2024/reader/completeness.html)[](https://www2.stat.duke.edu/courses/Spring22/sta732.01/lecture04.pdf) The completeness of the sufficient statistic in exponential families has important implications via Basu's theorem, which states that if $ T(\mathbf{x}) $ is a complete sufficient statistic and $ S(\mathbf{x}) $ is an ancillary statistic (whose distribution does not depend on $ \theta $), then $ T(\mathbf{x}) $ and $ S(\mathbf{x}) $ are independent for every $ \theta $. This independence facilitates the construction of unbiased estimators and tests by separating the information-carrying and non-informative components of the data.[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf)[](https://eml.berkeley.edu/~mcfadden/e240a_sp01/sufficiency.pdf) ### Moments and Cumulants In the canonical parameterization of an exponential family, the probability density function takes the form $ p(x \mid \eta) = h(x) \exp\left( \eta^\top T(x) - A(\eta) \right) $, where $ A(\eta) = \log \int h(x) \exp\left( \eta^\top T(x) \right) dx $ is the log-partition function and $ T(x) $ is the sufficient statistic. The derivatives of $ A(\eta) $ provide direct access to the moments of $ T(x) $. Specifically, the expected value of each component of the sufficient statistic is given by the first partial derivative: $ \mathbb{E}[T_i(x)] = \frac{\partial A(\eta)}{\partial \eta_i} $.[](https://pages.cs.wisc.edu/~andrzeje/lmml/exp-family-glms.pdf) In vector notation, this yields $ \mathbb{E}[T(x)] = \nabla_\eta A(\eta) $. The second partial derivatives of $ A(\eta) $ correspond to the covariances of the components of $ T(x) $: $ \text{Cov}(T_i(x), T_j(x)) = \frac{\partial^2 A(\eta)}{\partial \eta_i \partial \eta_j} $.[](https://pages.cs.wisc.edu/~andrzeje/lmml/exp-family-glms.pdf) Thus, the covariance matrix of $ T(x) $ is the Hessian matrix $ \nabla^2_\eta A(\eta) $, which is positive semi-definite due to the convexity of $ A(\eta) $. Higher-order moments can be obtained from higher derivatives, but the structure simplifies when considering cumulants, as $ A(\eta) $ itself serves as the cumulant-generating function for $ T(x) $ up to a shift.[](https://pages.cs.wisc.edu/~andrzeje/lmml/exp-family-glms.pdf) The moment-generating function of $ T(x) $ is $ M_T(t) = \exp\left( A(\eta + t) - A(\eta) \right) $, where $ t $ is a vector parameter. Consequently, the cumulant-generating function is $ K_T(t) = A(\eta + t) - A(\eta) $, and the cumulants of $ T(x) $ are the coefficients in its Taylor expansion around $ t = 0 $. In the scalar case (where $ \eta $ and $ T(x) $ are one-dimensional), the $ r $-th cumulant is $ \kappa_r = \frac{\partial^r A(\eta)}{\partial \eta^r} $.[](https://pages.cs.wisc.edu/~andrzeje/lmml/exp-family-glms.pdf) For the multivariate case, the joint cumulants of the components of $ T(x) $ are given by the mixed partial derivatives of $ A(\eta) $; for instance, the third-order joint cumulant involving $ T_i, T_j, T_k $ is $ \kappa_{ijk} = \frac{\partial^3 A(\eta)}{\partial \eta_i \partial \eta_j \partial \eta_k} $. These relations yield useful differential identities. In the scalar case, let $ \mu = \mathbb{E}[T(x)] = \frac{d A(\eta)}{d \eta} $; then the variance satisfies $ \text{Var}(T(x)) = \frac{d^2 A(\eta)}{d \eta^2} = \frac{d \mu}{d \eta} $.[](https://pages.cs.wisc.edu/~andrzeje/lmml/exp-family-glms.pdf) Differentiating further, the third cumulant (related to skewness) is $ \kappa_3 = \frac{d^3 A(\eta)}{d \eta^3} = \frac{d \text{Var}(T(x))}{d \eta} $, and the fourth cumulant (related to kurtosis) follows analogously as $ \kappa_4 = \frac{d^4 A(\eta)}{d \eta^4} = \frac{d \kappa_3}{d \eta} $. These identities highlight how changes in the natural parameter $ \eta $ propagate through the moments and cumulants of the sufficient statistic.[](https://pages.cs.wisc.edu/~andrzeje/lmml/exp-family-glms.pdf) ### Entropy Measures The differential entropy of a distribution $f_\theta$ in the exponential family, defined as $H(f_\theta) = -\mathbb{E}_{f_\theta}[\log f(x \mid \theta)]$, quantifies the average uncertainty or information content in a continuous random variable $X \sim f_\theta$. For the general form $f(x \mid \theta) = h(x) \exp[\eta(\theta) \cdot T(x) - A(\theta)]$, where $h(x)$ is the base measure, $\eta(\theta)$ is the natural parameter, $T(x)$ is the sufficient statistic vector, and $A(\theta)$ is the log-partition function, the entropy expands to $H(f_\theta) = A(\theta) - \eta(\theta) \cdot \mathbb{E}_{f_\theta}[T(X)] - \mathbb{E}_{f_\theta}[\log h(X)]$.[](https://statistics.berkeley.edu/sites/default/files/tech-reports/649.pdf) This expression highlights how the entropy depends on the normalizing constant $A(\theta)$, the moment constraints via $\mathbb{E}[T(X)] = \nabla A(\theta)$, and the contribution from the base measure $h(x)$.[](https://arxiv.org/pdf/1112.4221) In the canonical parameterization, where $\theta = \eta$ and the density simplifies to $f(x \mid \theta) = h(x) \exp[\theta \cdot T(x) - A(\theta)]$, the entropy further simplifies when the base measure term is constant or absorbed, yielding $H(f_\theta) = A(\theta) - \theta \cdot \nabla A(\theta)$.[](https://arxiv.org/pdf/1112.4221) Substituting $\nabla A(\theta) = \mathbb{E}[T(X)]$, this becomes $H(f_\theta) = A(\theta) - \theta \cdot \mathbb{E}[T(X)]$, emphasizing the trade-off between the log-partition function and the expected sufficient statistics. This form is particularly useful for analyzing uncertainty in parametric models, as the convexity of $A(\theta)$ ensures the entropy is a concave function of the parameters.[](https://statistics.berkeley.edu/sites/default/files/tech-reports/649.pdf) The relative entropy, or Kullback-Leibler (KL) divergence, between two members $f_\theta$ and $f_{\theta'}$ of the exponential family measures their informational asymmetry and takes a closed-form expression $D(f_\theta \parallel f_{\theta'}) = A(\theta') - A(\theta) - (\theta' - \theta) \cdot \mathbb{E}_\theta[T(X)]$.[](https://statistics.berkeley.edu/sites/default/files/tech-reports/649.pdf) This Bregman divergence, arising from the convexity of the log-partition function, vanishes if and only if $\theta = \theta'$ and provides a natural metric for comparing distributions within the family, with applications in variational inference and model selection. The expression underscores how deviations in natural parameters translate to differences in expected statistics under one distribution. Exponential families possess a maximum entropy characterization: among all distributions with fixed moments $\mathbb{E}[T_i(X)] = \mu_i$ for $i = 1, \dots, d$, the member $f_\theta$ that maximizes the differential entropy $H(f)$ belongs to the exponential family, with natural parameter $\theta$ chosen such that $\nabla A(\theta) = \mu$.[](https://cs229.stanford.edu/summer2019/MaxEnt.pdf) This property positions exponential families as the least informative priors or models consistent with observed moment constraints, aligning with Jaynes' principle of maximum entropy for inductive inference.[](https://www.cs.cmu.edu/~aarti/Class/10704/lec4-maxent2.pdf) To derive this, consider maximizing $H(f) = -\int f(x) \log f(x) \, d\nu(x)$ subject to $\int f(x) \, d\nu(x) = 1$, $f(x) \geq 0$, and $\int f(x) T_i(x) \, d\nu(x) = \mu_i$. Using Lagrange multipliers $\lambda_0$ for normalization and $\lambda_i$ for each moment constraint, the Lagrangian is $L(f, \lambda) = -\int f \log f \, d\nu + \lambda_0 \left(1 - \int f \, d\nu\right) + \sum_i \lambda_i \left(\mu_i - \int f T_i \, d\nu\right)$. Setting the functional derivative to zero yields $-\log f - 1 - \lambda_0 - \sum_i \lambda_i T_i = 0$, so $f(x) = \exp\left(-1 - \lambda_0 - \sum_i \lambda_i T_i(x)\right)$, which is the exponential family form with $\theta = -\lambda$ and $A(\theta) = -1 - \lambda_0$. The multipliers are then solved to satisfy the constraints, confirming the maximizer.[](https://cs229.stanford.edu/summer2019/MaxEnt.pdf) ## Examples ### Univariate Cases The univariate case of the exponential family arises when both the random variable $X$ and the parameter are scalars, providing foundational examples that illustrate the general form $f(x \mid \theta) = h(x) \exp\{\eta(\theta) T(x) - A(\theta)\}$. These distributions are parameterized such that the sufficient statistic $T(x)$ is low-dimensional, often one-dimensional, facilitating inference. Key examples include the normal, binomial, Poisson, exponential, and gamma distributions (with fixed shape), each of which can be expressed in this form after appropriate reparameterization. For the normal distribution with known variance $\sigma^2 > 0$ and unknown mean $\mu \in \mathbb{R}$, the probability density function is f(x \mid \mu) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), \quad x \in \mathbb{R}. ExpandingtheexponentyieldsExpanding the exponent yields -\frac{(x - \mu)^2}{2\sigma^2} = -\frac{x^2}{2\sigma^2} + \frac{\mu x}{\sigma^2} - \frac{\mu^2}{2\sigma^2}, soso f(x \mid \mu) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{x^2}{2\sigma^2} \right) \exp\left{ \frac{\mu}{\sigma^2} x - \frac{\mu^2}{2\sigma^2} \right}. Here, $h(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{x^2}{2\sigma^2} \right)$, $\eta(\theta) = \frac{\mu}{\sigma^2}$ with $\theta = \mu$, $T(x) = x$, and $A(\theta) = \frac{\mu^2}{2\sigma^2}$. In canonical form, $\eta = \frac{\mu}{\sigma^2}$ and $A(\eta) = \frac{\sigma^2 \eta^2}{2}$. Normalization holds as the integral of $f(x \mid \mu)$ over $\mathbb{R}$ equals 1 by the Gaussian integral property.[](https://www.stat.cmu.edu/~larry/=stat705/Lecture12a.pdf) The binomial distribution with fixed number of trials $n \in \mathbb{N}$ and success probability $p \in (0,1)$ has probability mass function f(k \mid n, p) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \dots, n. TakingthelogarithmgivesTaking the logarithm gives \log f(k \mid n, p) = \log \binom{n}{k} + k \log \frac{p}{1-p} + n \log (1-p), soso f(k \mid n, p) = \binom{n}{k} \exp\left{ k \log \frac{p}{1-p} + n \log (1-p) \right}. Thus, $h(k) = \binom{n}{k}$, $\eta(\theta) = \log \frac{p}{1-p}$ with $\theta = p$, $T(k) = k$, and $A(\theta) = -n \log (1-p)$. The natural parameter space is $\eta \in \mathbb{R}$, and $A(\eta) = n \log (1 + e^\eta)$ since $p = \frac{e^\eta}{1 + e^\eta}$ and $1-p = \frac{1}{1 + e^\eta}$. Normalization is verified by the sum over $k = 0$ to $n$ equaling 1, as it recovers the binomial theorem $(p + (1-p))^n = 1$.[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) For the Poisson distribution with rate parameter $\lambda > 0$, f(k \mid \lambda) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k = 0, 1, 2, \dots. ThisrewritesasThis rewrites as f(k \mid \lambda) = \frac{1}{k!} \exp\left( k \log \lambda - \lambda \right), with $h(k) = \frac{1}{k!}$, $\eta(\theta) = \log \lambda$ with $\theta = \lambda$, $T(k) = k$, and $A(\theta) = \lambda$. In canonical form, $\eta = \log \lambda$ so $\lambda = e^\eta$ and $A(\eta) = e^\eta$. The natural parameter space is $\eta \in \mathbb{R}$. Normalization follows from the sum $\sum_{k=0}^\infty \frac{\lambda^k}{k!} = e^\lambda$, ensuring the total probability is $e^{-\lambda} e^\lambda = 1$.[](https://www.stat.cmu.edu/~larry/=stat705/Lecture12a.pdf) The exponential distribution with rate parameter $\lambda > 0$ has density f(x \mid \lambda) = \lambda e^{-\lambda x}, \quad x > 0. ThisisThis is f(x \mid \lambda) = \exp\left( \log \lambda - \lambda x \right), so $h(x) = 1$ (for $x > 0$), $\eta(\theta) = -\lambda$ with $\theta = \lambda$, $T(x) = x$, and $A(\theta) = -\log \lambda$. The natural parameter space is $\eta < 0$, and $A(\eta) = -\log (-\eta)$. Normalization is confirmed by the integral $\int_0^\infty \lambda e^{-\lambda x} \, dx = 1$.[](https://www.stat.umn.edu/geyer/s16/5421/notes/expfam.pdf) When the shape parameter $\alpha > 0$ is fixed in the gamma distribution and the rate $\beta > 0$ varies, the density is f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0. ThisfactorsasThis factors as f(x \mid \alpha, \beta) = \frac{x^{\alpha-1}}{\Gamma(\alpha)} \exp\left( -\beta x + \alpha \log \beta \right), yielding $h(x) = \frac{x^{\alpha-1}}{\Gamma(\alpha)}$ (for $x > 0$), $\eta(\theta) = -\beta$ with $\theta = \beta$, $T(x) = x$, and $A(\theta) = -\alpha \log \beta$. In canonical form, $A(\eta) = -\alpha \log (-\eta)$. The natural parameter space is $\eta < 0$. Normalization holds via the gamma integral $\int_0^\infty \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} \, dx = 1$.[](http://parker.ad.siu.edu/Olive/ich3.pdf) ### Multivariate Cases The exponential family framework naturally accommodates multivariate distributions through vector-valued sufficient statistics $ \mathbf{T}(\mathbf{x}) $ and natural parameters $ \boldsymbol{\eta} $, where the density takes the form $ f(\mathbf{x} \mid \boldsymbol{\eta}) = h(\mathbf{x}) \exp\left( \boldsymbol{\eta}^\top \mathbf{T}(\mathbf{x}) - A(\boldsymbol{\eta}) \right) $. This vector parameterization allows the family to capture dependencies among multiple random variables, as the covariance structure emerges from the second derivatives of the log-partition function $ A(\boldsymbol{\eta}) $.[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)[](https://cs.brown.edu/courses/cs242/lectures/lec03a_expFamily.pdf) The multinomial distribution provides a key discrete example, describing the joint distribution of counts $ \mathbf{n} = (n_1, \dots, n_k) $ across $ k $ categories with fixed total $ n = \sum_i n_i $. Its probability mass function is $ f(\mathbf{n} \mid \mathbf{p}) = \frac{n!}{\prod_i n_i!} \prod_{i=1}^k p_i^{n_i} $, where $ \mathbf{p} = (p_1, \dots, p_k) $ lies on the simplex $ \sum_i p_i = 1 $ and $ p_i > 0 $. In canonical exponential form, the sufficient statistic is $ \mathbf{T}(\mathbf{n}) = (n_1, \dots, n_{k-1})^\top $, with the $ k $-th count determined by the total $ n $, and the natural parameters are $ \eta_i = \log(p_i / p_k) $ for $ i = 1, \dots, k-1 $. This yields probabilities $ p_i = \frac{\exp(\eta_i)}{1 + \sum_{j=1}^{k-1} \exp(\eta_j)} $ for $ i = 1, \dots, k-1 $ and $ p_k = \frac{1}{1 + \sum_{j=1}^{k-1} \exp(\eta_j)} $, with log-partition function $ A(\boldsymbol{\eta}) = n \log\left(1 + \sum_{j=1}^{k-1} \exp(\eta_j)\right) $. The base measure is $ h(\mathbf{n}) = \frac{n!}{\prod_i n_i!} \mathbf{1}_{\{\sum n_i = n, n_i \geq 0\}} $.[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)[](https://ocw.mit.edu/courses/18-655-mathematical-statistics-spring-2016/669dd844d5208e4a01999a71bb1155ab_MIT18_655S16_LecNote8.pdf) For continuous multivariate cases, the $ d $-dimensional normal distribution with known covariance matrix $ \boldsymbol{\Sigma} $ and unknown mean vector $ \boldsymbol{\mu} $ illustrates the role of linear sufficient statistics. The density is $ f(\mathbf{x} \mid \boldsymbol{\mu}) = (2\pi)^{-d/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) $, which rewrites in exponential form as $ f(\mathbf{x} \mid \boldsymbol{\eta}) \propto \exp\left( \boldsymbol{\eta}^\top \mathbf{x} - A(\boldsymbol{\eta}) \right) $, where $ \boldsymbol{\eta} = \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} $ is the natural parameter, $ \mathbf{T}(\mathbf{x}) = \mathbf{x} $ is the sufficient statistic, and $ A(\boldsymbol{\eta}) = \frac{1}{2} \boldsymbol{\eta}^\top \boldsymbol{\Sigma} \boldsymbol{\eta} $. The base measure incorporates the constant $ (2\pi)^{-d/2} |\boldsymbol{\Sigma}|^{-1/2} $. This form highlights how the vector $ \boldsymbol{\eta} $ encodes the precision-weighted mean, facilitating inference on location parameters.[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf)[](http://www.stat.umn.edu/geyer/5421/notes/expfam.pdf) When both the mean $ \boldsymbol{\mu} $ and covariance $ \boldsymbol{\Sigma} $ are unknown in the multivariate normal, the distribution belongs to a higher-dimensional exponential family parameterized via vectorized forms to handle the matrix structure. For a single $ d $-dimensional observation $ \mathbf{x} $, the sufficient statistics are the vector $ \mathbf{T}_1(\mathbf{x}) = \mathbf{x} $ and the vectorized outer product $ \mathbf{T}_2(\mathbf{x}) = \mathrm{vec}(\mathbf{x} \mathbf{x}^\top) $, where $ \mathrm{vec} $ stacks the columns of the symmetric matrix $ \mathbf{x} \mathbf{x}^\top $. The natural parameters combine $ \boldsymbol{\eta}_1 = \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} $ and $ \boldsymbol{\eta}_2 = -\frac{1}{2} \mathrm{vec}(\boldsymbol{\Sigma}^{-1}) $, yielding the exponent $ \boldsymbol{\eta}_1^\top \mathbf{x} + \boldsymbol{\eta}_2^\top \mathrm{vec}(\mathbf{x} \mathbf{x}^\top) - A(\boldsymbol{\eta}) $, with the log-partition function $ A(\boldsymbol{\eta}) $ involving traces and quadratic forms in $ \boldsymbol{\Sigma} $ derived from completing the square in the quadratic exponent. This Wishart-like parameterization (evident in the conjugate prior context) accounts for scale and shape via the matrix-variate parameters, though the effective dimension is reduced due to symmetry constraints on $ \boldsymbol{\Sigma} $.[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf)[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf) The Dirichlet distribution, which serves as the conjugate prior for the multinomial parameters $ \mathbf{p} $, exemplifies a multivariate distribution over the simplex. For a $ k $-dimensional vector $ \mathbf{x} = (x_1, \dots, x_k) $ with $ x_i > 0 $ and $ \sum_i x_i = 1 $, the density is $ f(\mathbf{x} \mid \boldsymbol{\alpha}) = \frac{\Gamma(\sum_i \alpha_i)}{\prod_i \Gamma(\alpha_i)} \prod_{i=1}^k x_i^{\alpha_i - 1} $, where $ \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_k) $ with $ \alpha_i > 0 $. In exponential form, the sufficient statistic is $ \mathbf{T}(\mathbf{x}) = (\log x_1, \dots, \log x_k)^\top $, the natural parameters are $ \boldsymbol{\eta} = (\alpha_1 - 1, \dots, \alpha_k - 1)^\top $, and the log-partition function is $ A(\boldsymbol{\eta}) = \log \Gamma\left( \sum_i (\eta_i + 1) \right) - \sum_i \log \Gamma(\eta_i + 1) $, with base measure $ h(\mathbf{x}) = \mathbf{1}_{\{\mathbf{x} \in \mathrm{simplex}\}} $. This setup underscores the Dirichlet's role in Bayesian updating for multinomial models, where posterior parameters additively combine prior and data contributions.[](http://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/dirichlet.pdf)[](https://www.cs.columbia.edu/~jebara/4771/tutorials/lecture12.pdf) In all these multivariate examples, the vector form of the exponential family elegantly captures inter-variable correlations through the covariance matrix of the sufficient statistic, given by the Hessian $ \nabla^2 A(\boldsymbol{\eta}) = \mathrm{Cov}(\mathbf{T}(\mathbf{x})) $. For instance, in the multinomial, this Hessian yields the covariance $ n p_i ( \delta_{ij} - p_j ) $ among counts, reflecting negative dependencies due to the fixed total. Similarly, for the multivariate normal with known $ \boldsymbol{\Sigma} $, $ \nabla^2 A(\boldsymbol{\eta}) = \boldsymbol{\Sigma} $, directly providing the dispersion structure. This property ensures that the parameterization inherently encodes joint variability, making exponential families suitable for modeling correlated data in higher dimensions.[](https://cs.brown.edu/courses/cs242/lectures/lec03a_expFamily.pdf)[](https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec19.pdf) ### Table of Common Distributions The exponential family includes a wide array of common probability distributions, demonstrating their shared structure and facilitating unified statistical analysis. The table below summarizes the exponential family parameterization for several prominent examples, specifying the support, base measure $h(x)$, sufficient statistic $T(x)$, natural parameter $\eta(\theta)$, log-partition function $A(\theta)$, and relevant parameter constraints where applicable. Note that the table presents one-parameter versions of the gamma and beta distributions by fixing one shape parameter, though they are full two-parameter exponential families; the chi-squared distribution is a special case of the gamma with fixed shape $k/2$ and scale 2, included here as a degenerate (fixed-parameter) case. This selection highlights the versatility of the family, with most standard distributions fitting the form.[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf)[](https://www.casact.org/sites/default/files/database/dpp_dpp04_04dpp117.pdf)[](https://www.statlect.com/fundamentals-of-statistics/exponential-family-of-distributions)[](https://www.stat.purdue.edu/~dasgupta/expfamily.pdf) | Distribution | Support | $ h(x) $ | $ T(x) $ | $ \eta(\theta) $ | $ A(\theta) $ | Parameter Constraints | |--------------------|--------------------------|-------------------------------------|----------------|-------------------------------------|----------------------------------|----------------------------------------| | Bernoulli | $ x \in \{0, 1\} $ | 1 | $ x $ | $ \log \frac{p}{1-p} $ | $ \log(1 + e^{\eta}) $ | $ 0 < p < 1 $, $ \eta \in \mathbb{R} $ | | Binomial (fixed $ n $) | $ x = 0, 1, \dots, n $ | $ \binom{n}{x} $ | $ x $ | $ \log \frac{p}{1-p} $ | $ n \log(1 + e^{\eta}) $ | $ 0 < p < 1 $, $ n > 0 $ integer | | Poisson | $ x = 0, 1, 2, \dots $ | $ \frac{1}{x!} $ | $ x $ | $ \log \lambda $ | $ e^{\eta} $ | $ \lambda > 0 $, $ \eta \in \mathbb{R} $ | | Geometric (trials until first success minus 1) | $ x = 0, 1, 2, \dots $ | 1 | $ x $ | $ \log(1 - p) $ | $ -\log(1 - e^{\eta}) $ | $ 0 < p < 1 $, $ \eta < 0 $ | | Negative Binomial (fixed $ r $, failures before $ r $ successes) | $ x = 0, 1, 2, \dots $ | $ \binom{x + r - 1}{x} $ | $ x $ | $ \log(1 - p) $ | $ -r \log(1 - e^{\eta}) $ | $ 0 < p < 1 $, $ r > 0 $, $ \eta < 0 $ | | Exponential (special gamma, shape 1) | $ x > 0 $ | 1 | $ -x $ | $ \beta $ (rate) | $ -\log \eta $ | $ \beta > 0 $, $ \eta > 0 $ | | Gamma (fixed shape $ \alpha $) | $ x > 0 $ | $ \frac{x^{\alpha - 1}}{\Gamma(\alpha)} $ | $ -x $ | $ \beta $ (rate) | $ -\alpha \log \eta $ | $ \alpha > 0 $ fixed, $ \beta > 0 $, $ \eta > 0 $ | | Chi-squared (fixed df $ k $) | $ x > 0 $ | $ \frac{x^{k/2 - 1}}{2^{k/2} \Gamma(k/2)} $ | $ -x/2 $ | $ 1 $ (fixed scale 2) | Constant (degenerate case) | k positive integer fixed | | Normal (mean $ \mu $, variance $ \sigma^2 $) | $ x \in \mathbb{R} $ | $ \frac{1}{\sqrt{2\pi}} $ | $ [x, x^2] $ | $ [\mu / \sigma^2, -1/(2\sigma^2)] $ | $ -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2} \log(-2\eta_2) $ | $ \sigma^2 > 0 $, $ \eta_2 < 0 $ | | Beta (fixed shape $ \alpha $, varying $ \beta $) | $ x \in (0,1) $ | $ x^{\alpha - 1} $ | $ \log(1 - x) $ | $ \beta - 1 $ | $ \log B(\alpha, \eta + 1) $ | $ \alpha > 0 $ fixed, $ \beta > 0 $, $ \eta > -1 $ | | Inverse Gaussian ($ \mu $, $ \lambda $) | $ x > 0 $ | $ (2\pi x^3)^{-1/2} $ | $ [x, 1/x] $ | $ [-\frac{\lambda}{2\mu^2}, -\frac{\lambda}{2}] $ | $ -2 \sqrt{\eta_1 \eta_2} - \frac{1}{2} \log(-2 \eta_2) $ | $ \mu > 0 $, $ \lambda > 0 $, $ \eta_1 < 0 $, $ \eta_2 < 0 $ | | Von Mises (fixed location $ \mu = 0 $) | $ x \in [0, 2\pi) $ | $ \frac{1}{2\pi} $ | $ \cos x $ | $ \kappa $ (concentration) | $ \log I_0(\eta) $ | $ \kappa \geq 0 $, $ \eta \geq 0 $; $ I_0 $ modified Bessel function | ## Applications in Statistics ### Classical Estimation In the context of exponential families, classical estimation refers to frequentist methods for parameter inference, which exploit the structure of the family to derive optimal estimators and tests. The exponential family form facilitates explicit expressions for likelihood-based procedures, leveraging the sufficient statistic $T(\mathbf{x})$ to simplify computations. These methods emphasize properties like consistency, efficiency, and power, grounded in asymptotic theory and completeness. Maximum likelihood estimation (MLE) in an exponential family is particularly tractable due to the separability of the log-likelihood. For $n$ independent observations from a density $f(x_i \mid \theta) = h(x_i) \exp[\eta(\theta) \cdot T(x_i) - A(\theta)]$, the log-likelihood is $\ell(\theta) = \sum_{i=1}^n [\eta(\theta) \cdot T(x_i) - A(\theta)] + \sum_{i=1}^n \log h(x_i)$. For the natural parameter $\eta$, the score function simplifies to $\nabla \ell(\eta) = n [\bar{T} - \mathbb{E}_\eta[T]]$, where $\bar{T} = n^{-1} \sum T(x_i)$ is the observed average sufficient statistic. Setting the score to zero yields the estimating equation $\nabla A(\eta) = \bar{T}$, which is solved for the natural parameter $\eta$ and back-transformed to $\theta$. This procedure ensures the MLE $\hat{\theta}$ is a function of the complete sufficient statistic $T$, inheriting desirable properties from the family structure.[](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf) For unbiased estimation, the completeness of the sufficient statistic in full-rank exponential families enables the construction of uniformly minimum variance unbiased estimators (UMVUEs) via the Lehmann-Scheffé theorem. Specifically, if $U$ is an unbiased estimator of a parameter function $g(\theta)$, then the conditional expectation $\mathbb{E}[U \mid T]$ is the unique UMVUE, as completeness ensures no other unbiased estimator based on $T$ has lower variance. This theorem underscores the efficiency of conditioning on the sufficient statistic, reducing the problem to estimating within the minimal sufficient subspace spanned by $T$. In hypothesis testing, one-parameter exponential families exhibit a monotone likelihood ratio (MLR) in the sufficient statistic $T$, as established by the Neyman-Pearson lemma. For testing a simple null $H_0: \theta = \theta_0$ against a simple alternative $H_1: \theta = \theta_1 > \theta_0$, the most powerful test rejects for large values of $T$, with the likelihood ratio $\Lambda = \frac{f(\mathbf{x} \mid \theta_0)}{f(\mathbf{x} \mid \theta_1)}$ being a decreasing function of $T$. This MLR property extends to one-sided composite hypotheses, yielding uniformly most powerful (UMP) tests that reject $H_0: \theta \leq \theta_0$ versus $H_1: \theta > \theta_0$ for $T > c$, where $c$ is chosen to control the size. Asymptotically, the MLE in exponential families is consistent and asymptotically efficient, attaining the Cramér-Rao lower bound. The asymptotic variance of $\sqrt{n} (\hat{\theta} - \theta)$ is the inverse of the Fisher information matrix $I(\theta) = - \mathbb{E} [\partial^2 \ell / \partial \theta^2] = n \mathrm{Var}_\theta (T / n)$, reflecting the variability of the normalized sufficient statistic. This efficiency holds under regularity conditions, such as differentiability of $A(\theta)$, ensuring the MLE's normal approximation for large samples. A concrete illustration is hypothesis testing for the Poisson distribution, which belongs to the one-parameter exponential family with $T = \sum x_i$ and natural parameter $\eta = \log \lambda$. For $H_0: \lambda = \lambda_0$ versus $H_1: \lambda > \lambda_0$, the UMP level-$\alpha$ test rejects if $T > c$, where $c$ satisfies $\mathbb{P}(T > c \mid \lambda_0) = \alpha$ and $T \sim \mathrm{Poisson}(n \lambda_0)$. This test's power function is strictly increasing in $\lambda$, confirming its uniformity. ### Bayesian Inference In Bayesian inference for distributions belonging to the exponential family, conjugate priors play a central role due to their computational advantages. For a likelihood of the form $ p(\mathbf{x} | \eta) = \exp\left( \sum_i T(x_i) \cdot \eta - n A(\eta) + \sum_i B(x_i) \right) $, where $\eta$ is the natural parameter, $T(\cdot)$ is the sufficient statistic, $A(\cdot)$ is the log-partition function, and $n$ is the sample size, a conjugate prior on $\eta$ takes the form $\pi(\eta) \propto \exp\left( \nu \cdot \eta - b A(\eta) \right)$, with hyperparameters $\nu$ and $b > 0$.[](https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-2/Conjugate-Priors-for-Exponential-Families/10.1214/aos/1176344611.full) This prior ensures that the posterior distribution remains within the same family after observing data, specifically updating to $\pi(\eta | \mathbf{x}) \propto \exp\left( (\nu + \sum_i T(x_i)) \cdot \eta - (b + n) A(\eta) \right)$, yielding updated hyperparameters $\nu' = \nu + \sum_i T(x_i)$ and $b' = b + n$.[](https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-2/Conjugate-Priors-for-Exponential-Families/10.1214/aos/1176344611.full) The conjugacy property arises because the exponential family structure allows the prior to be expressed in a form that mirrors the likelihood, facilitating closed-form posterior updates without numerical integration.[](https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-2/Conjugate-Priors-for-Exponential-Families/10.1214/aos/1176344611.full) In the canonical parameterization, the prior itself belongs to an exponential family on $\eta$, which simplifies moment calculations and predictive distributions.[](https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-2/Conjugate-Priors-for-Exponential-Families/10.1214/aos/1176344611.full) This tractability is particularly valuable in scenarios requiring repeated inference, such as sequential updating or simulation-based methods. Common examples illustrate this framework. For the normal distribution with known variance (exponential family in the mean parameter), the conjugate prior is normal, updating the prior mean and precision with sample statistics.[](https://www.jstor.org/stable/2958808) For the binomial distribution, the beta prior on the success probability conjugates naturally, with posterior parameters as alpha' = alpha + successes and beta' = beta + failures.[](https://www.jstor.org/stable/2958808) Similarly, for the Poisson distribution, a gamma prior on the rate parameter yields a gamma posterior, updating shape and rate hyperparameters based on observed counts and sample size.[](https://www.jstor.org/stable/2958808) Posterior inference for moments leverages the exponential family cumulant function. The posterior mean of the sufficient statistic expectation is $\mathbb{E}_{\pi'}[ \nabla A(\eta) ] = \frac{\nu'}{b'}$, derived from differentiating the log-posterior, while the variance follows from the second derivative $\text{Var}_{\pi'}( \nabla A(\eta) ) = b'^{-1} \nabla^2 A(\mathbb{E}_{\pi'}[\eta])$.[](https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-2/Conjugate-Priors-for-Exponential-Families/10.1214/aos/1176344611.full) These expressions provide direct summaries of parameter uncertainty without sampling. In hierarchical models, the exponential family structure with conjugate priors enables efficient computation of marginal likelihoods, facilitating Bayes factors for model comparison and selection.[](https://www.jstor.org/stable/4616270) For instance, integrating out parameters layer by layer preserves conjugacy, yielding tractable expressions for evidence in multi-level setups.[](https://www.jstor.org/stable/4616270) ### Generalized Linear Models Generalized linear models (GLMs) provide a unified framework for regression analysis where the response variable follows a distribution from the exponential family, extending classical linear models to accommodate non-normal responses such as binary, count, or positive continuous data. In this setup, each observation $ y_i $ is assumed to be independently distributed according to an exponential family density with mean $ \mu_i = g^{-1}(X_i \beta) $, where $ g $ is the link function relating the mean to the linear predictor $ \eta_i = X_i \beta $, $ X_i $ is the covariate vector, and $ \beta $ is the parameter vector. This structure allows the mean of the response to depend on predictors through a flexible transformation, enabling applications like logistic regression for binary outcomes or Poisson regression for counts.[](https://www.routledge.com/Generalized-Linear-Models/McCullagh-Nelder/p/book/9780412317606)[](https://www.jstor.org/stable/2344614) The canonical link function, $ g(\mu) = \theta $, where $ \theta $ is the natural parameter of the exponential family, is particularly advantageous as it aligns the linear predictor directly with $ \eta = \theta = x^T \beta $, simplifying maximum likelihood estimation (MLE). Under the canonical link, the score equations for MLE reduce to a weighted least squares problem, where the weights are inversely proportional to the variance function of the distribution. This property facilitates efficient computation and interpretation, as the sufficient statistics from the exponential family directly inform the regression coefficients.[](https://www.routledge.com/Generalized-Linear-Models/McCullagh-Nelder/p/book/9780412317606)[](https://www.jstor.org/stable/2344614) Model fit in GLMs is often assessed using deviance, a likelihood-based criterion analogous to residual sum of squares in linear models. The deviance is defined as $ D = 2 \sum_{i=1}^n \left[ l_i(\hat{\mu}_i^{\text{sat}}; y_i) - l_i(\hat{\mu}_i; y_i) \right] $, where $ l_i $ is the log-likelihood contribution, $ \hat{\mu}_i^{\text{sat}} = y_i $ for the saturated model, and $ \hat{\mu}_i $ is the fitted mean. For exponential family distributions, this is $ D = 2 \sum_{i=1}^n d(y_i; \hat{\mu}_i) $, where the unit deviance is $ d(y; \mu) = 2 \left[ y (\theta(y) - \theta(\mu)) - A(\theta(y)) + A(\theta(\mu)) \right] $, with $\theta$ the natural parameter such that $\mu = \nabla A(\theta)$ (up to adjustments for the dispersion parameter), providing a measure of discrepancy between observed and predicted values that follows an approximate chi-squared distribution under the null for nested models.[](https://www.routledge.com/Generalized-Linear-Models/McCullagh-Nelder/p/book/9780412317606)[](https://www.cs.columbia.edu/~blei/fogm/2022F/readings/Efron2018.pdf) Parameter estimation in GLMs typically employs the iteratively reweighted least squares (IRLS) algorithm, which approximates the MLE through successive Newton-Raphson iterations. At each step, a working response $ z_i = \eta_i + (y_i - \mu_i) g'(\mu_i) $ is constructed, and weights $ w_i = 1 / [V(\mu_i) (g'(\mu_i))^2] $ are applied, where $ V(\mu) $ is the variance function derived from the exponential family as $ V(\mu) = \frac{d^2 A(\theta)}{d \theta^2} $ (with θ the natural parameter such that μ = ∇A(θ), scaled by the dispersion φ if present). IRLS converges quickly for canonical links and handles the nonlinearity inherent in non-Gaussian responses.[](https://www.routledge.com/Generalized-Linear-Models/McCullagh-Nelder/p/book/9780412317606)[](https://www.jstor.org/stable/2344614)[](https://rls.sites.oasis.unc.edu/s556-2019/GLM-Handout.pdf) Extensions of the basic GLM framework address common violations of exponential family assumptions, such as overdispersion where the variance exceeds the mean-based prediction (e.g., in count data with clustering). Overdispersion is accommodated by scaling the variance with a dispersion parameter $ \phi > 1 $ in quasi-likelihood estimation, preserving the mean structure while adjusting standard errors. Zero-inflated models further extend GLMs for responses with excess zeros, combining a point mass at zero with an exponential family component (e.g., zero-inflated Poisson), modeled via a mixture where the zero probability is regressed on covariates separately from the positive part. These extensions maintain the GLM's iterative estimation while enhancing flexibility for real-world data.[](https://www.routledge.com/Generalized-Linear-Models/McCullagh-Nelder/p/book/9780412317606)[](https://www.stata-press.com/books/generalized-linear-models-and-extensions/) The GLM framework was developed by Nelder and Wedderburn in 1972, providing a cohesive approach that unifies diverse models including linear regression, logistic regression for binary data, and probit models, all under the exponential family umbrella.[](https://www.jstor.org/stable/2344614)
Add your contribution
Related Hubs
User Avatar
No comments yet.