Graphical Models

Graphical ModelsMain

Community hub

Graphical Models

7 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

Graphical Models

View on Wikipedia

from Wikipedia

Graphical Models is an academic journal in computer graphics and geometry processing publisher by Elsevier. As of 2021^[update], its editor-in-chief is Bedrich Benes of the Purdue University.^[1]

History

[edit]

This journal has gone through multiple names.^[2] Founded in 1972 as Computer Graphics and Image Processing^[3] by Azriel Rosenfeld, it became the first journal to focus on computer image analysis.^[4]^[5] Its first change of name came in 1983, when it became Computer Vision, Graphics, and Image Processing.^[6] In 1991, it split into two journals, CVGIP: Graphical Models and Image Processing,^[7] and CVGIP: Image Understanding, which later became Computer Vision and Image Understanding.^[8] Meanwhile, in 1995, the journal Graphical Models and Image Processing removed the "CVGIP" prefix from its former name,^[9] and finally took its current title, Graphical Models, in 2002.^[10]

Ranking

[edit]

Although initially ranked by SCImago Journal Rank as a top-quartile journal in 1999 in its main topic areas, computer graphics and computer-aided design, and then for many years ranked as second-quartile, by 2020 it had fallen to the third quartile.^[11]

References

[edit]

^ Editorial board: Graphical Models, Elsevier, retrieved 2021-10-03
^ Nixon, Mark (2013), "1.6.1 Journals and magazines", Feature Extraction and Image Processing, Elsevier, pp. 24–25, ISBN 9780080506258
^ "Computer Graphics and Image Processing", Articles & issues, Elsevier, retrieved 2021-10-03
^ Davis, Larry S., ed. (2012), Foundations of Image Understanding, The Springer International Series in Engineering and Computer Science, vol. 628, Springer, p. x, ISBN 9781461515296
^ Davis, Larry (October 2004), "In Memory of Azriel Rosenfeld", International Journal of Computer Vision, 60 (1), Springer Science and Business Media {LLC}: 3–4, doi:10.1023/b:visi.0000036299.15461.72
^ "Computer Vision, Graphics, and Image Processing", Articles & issues, Elsevier, retrieved 2021-10-03
^ "CVGIP: Graphical Models and Image Processing", Articles & issues, Elsevier, retrieved 2021-10-03
^ Zhang, Yujin (2017), Image Understanding, Tsinghua University Press and Walter de Gruyter GmbH, p. 5, ISBN 9783110524130
^ "Graphical Models and Image Processing", Articles & issues, Elsevier, retrieved 2021-10-03
^ "Graphical Models", Articles & issues, Elsevier, retrieved 2021-10-03
^ "Graphical Models", SCImago Journal Rank, retrieved 2021-10-03

Revisions and contributors Edit on Wikipedia Read on Wikipedia

View on Grokipedia

from Grokipedia

Graphical models, also known as probabilistic graphical models, are a mathematical framework that combines probability theory and graph theory to compactly represent joint probability distributions over a set of random variables by exploiting conditional independencies among them. The concept was pioneered in the 1980s by Judea Pearl with Bayesian networks.^[1]^[2] These models provide a principled approach to handling uncertainty in complex systems through probabilistic reasoning and manage structural complexity via graphical encodings of dependencies, enabling efficient computation where full joint distributions would be exponentially large—for instance, for n binary variables, a complete joint requires 2ⁿ - 1 parameters, but graphical models factorize this into far fewer local components.^[1] They unify diverse statistical and machine learning techniques, including hidden Markov models and Kalman filters, and gained prominence since the 1990s for applications in artificial intelligence, computer vision, natural language processing, and bioinformatics.^[3] The core components of graphical models include a graph where nodes represent random variables (discrete or continuous) and edges encode probabilistic relationships, along with parameters specifying local probability distributions.^[4] There are two primary types: directed graphical models, or Bayesian networks, which use directed acyclic graphs (DAGs) to model causal or hierarchical dependencies, with the joint distribution factorizing as P(X₁, ..., X_n) = ∏_i=1ⁿ P(X_i | Pa(X_i)), where Pa(X_i) denotes the parents of X_i; and undirected graphical models, or Markov random fields, which employ undirected graphs to capture symmetric associations, factorizing as P(X) = (1/Z) ∏_C ψ_C(X_C) over cliques C, with Z as the normalizing partition function and ψ_C as potential functions.^[1] Independencies are queried using criteria like d-separation in Bayesian networks (which blocks paths through conditioned nodes or colliders) or separation in Markov fields (blocking paths via observed nodes), allowing modular inference algorithms such as belief propagation for exact results on tree-structured graphs or approximations like loopy belief propagation and Markov chain Monte Carlo for loopy ones.^[3] Graphical models support key tasks including inference (computing marginals or most likely explanations), parameter learning from data via maximum likelihood or Bayesian methods, and structure learning to discover graph topologies, often exploiting low treewidth for tractability despite NP-hard complexity in general cases.^[1] Their representational power stems from the Hammersley-Clifford theorem, which for positive distributions equates graph separation to conditional independence, and from local Markov properties that ensure each variable is independent of non-adjacent ones given its neighbors.^[1] This framework has revolutionized probabilistic modeling by enabling scalable solutions to real-world problems involving uncertainty, such as medical diagnosis, genetic analysis, and autonomous systems, with ongoing advances in scalable inference for massive datasets.^[4]

Overview

Definition and Motivation

Graphical models are probabilistic models that represent multivariate probability distributions through a graph structure, where nodes correspond to random variables and edges encode conditional dependencies between them.^[5] This graphical representation allows the joint distribution over the variables to be factored into a product of simpler terms, each defined over smaller subsets of variables connected in the graph.^[6] The framework unifies concepts from probability theory and graph theory, enabling the compact encoding of complex relationships in high-dimensional data.^[7] The motivation for graphical models arose in the late 1980s from the need to handle uncertainty and incomplete information in artificial intelligence and statistics, where traditional tabular representations of joint distributions become infeasible for large numbers of variables.^[8] Pioneering work by Judea Pearl introduced directed graphical models, known as Bayesian networks, to model causal and probabilistic reasoning in intelligent systems.^[8] Concurrently, statisticians like Steffen L. Lauritzen advanced undirected models to capture symmetric dependencies, addressing challenges in data assimilation and inference under uncertainty.^[8] These developments provided a principled way to reason about multivariate systems, such as in medical diagnosis or pattern recognition, by exploiting conditional independences to simplify computations that would otherwise be intractable.^[5] A basic example illustrates this simplification: consider two random variables

X

and

Y

with a dependency, represented by a graph with nodes for

X

and

Y

connected by an edge. The joint probability distribution factors as

p(X, Y) = p(X) \cdot p(Y \mid X)

, avoiding the need to specify the full

p(X, Y)

table directly and highlighting how the graph guides the decomposition.^[6] This factorization leverages the conditional dependency encoded by the edge, making the model more intuitive and computationally efficient than enumerating all possible outcomes.^[5] Key advantages of graphical models include their visual interpretability, which aids in understanding and communicating complex probabilistic structures; modularity, allowing models to be built and extended by combining subgraphs; and scalability to large datasets by focusing computations on local dependencies rather than global joints.^[7] Directed and undirected graphs offer complementary perspectives, with the former emphasizing causal flows and the latter symmetric associations.^[5]

Key Components

Graphical models represent probabilistic relationships among random variables through a graph structure, where the nodes and edges encode dependencies. The fundamental elements include nodes, which correspond to random variables that can be either discrete or continuous. For instance, a discrete node might represent a binary outcome like "rain" or "no rain," while a continuous node could model temperature values. Directed edges in the graph signify causal or conditional dependencies, pointing from parent variables to child variables, whereas undirected edges indicate mutual associations without a specified direction. These graph elements form the backbone of the model, allowing for compact representation of complex joint distributions. Associated with these graph structures are probability elements that quantify the dependencies. In undirected graphical models, clique potentials—also known as factors—assign non-negative values to configurations of variables within fully connected subgraphs (cliques), which are then normalized to yield the joint probability distribution. For directed models, conditional probability tables (CPTs) or functions specify the probability of a child variable given its parents, enabling factorization of the joint distribution along the graph's topology. These potentials and factors ensure that the model captures local interactions efficiently, avoiding explicit enumeration of all variable combinations. The graph structure also embodies separation criteria that link topology to probabilistic properties, such as conditional independence. Markov properties define how the graph implies independencies: in directed graphs, d-separation determines if a set of variables blocks all paths between others, implying conditional independence; in undirected graphs, Markov blankets—comprising parents, children, and co-parents—shield a node from the rest of the network. These criteria provide a visual and algorithmic means to assess independencies without full probability computations. Standard notation in the field includes

G = (V, E)

for the graph with vertices

V

(nodes) and edges

E

X = \{X_1, \dots, X_n\}

for the set of random variables, and

P(X)

for their joint distribution. This notation facilitates precise description and analysis of models. These components tie into broader probabilistic foundations, where the graph structure encodes conditional independences that underpin efficient representation and inference.

Types

Directed Graphical Models

Directed graphical models, also known as Bayesian networks, represent probabilistic relationships among a set of random variables using directed acyclic graphs (DAGs). In these models, nodes correspond to variables, and directed edges indicate conditional dependencies, where an edge from node A to node B signifies that B depends directly on A.^[9] This structure allows for compact encoding of joint probability distributions, particularly useful in modeling causal or influence relationships in complex systems.^[9] The joint probability distribution over all variables

X_1, \dots, X_n

in a Bayesian network factorizes according to the graph structure as the product of each variable's conditional probability given its parents:

P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \mathrm{Parents}(X_i))

Here,

\mathrm{Parents}(X_i)

denotes the set of nodes with direct edges pointing to

X_i

, and each

P(X_i \mid \mathrm{Parents}(X_i))

is typically represented by a conditional probability table (CPT) or function. This factorization exploits conditional independencies encoded by the graph, reducing the number of parameters needed compared to a full joint distribution table.^[9] A defining property of Bayesian networks is their acyclicity, which ensures no directed cycles exist, preventing feedback loops and allowing variables to be ordered topologically. This topological order facilitates efficient computation by enabling sequential evaluation of probabilities from root nodes (those without parents) to leaves.^[10] A classic example is the student grade diagnostic model, featuring variables for course difficulty (D), student intelligence (I), grade (G), and SAT score (S). Edges point from D and I to G (indicating G depends on both), and from I to S (S depends on I), capturing how these factors influence outcomes without implying symmetry, unlike in undirected models.^[10]

Undirected Graphical Models

Undirected graphical models, also known as Markov random fields (MRFs), represent joint probability distributions over random variables using undirected graphs, where nodes correspond to variables and edges encode symmetric mutual influences between them, without implying directionality or causality.^[11] In contrast to directed models, which are limited to acyclic structures, MRFs naturally accommodate cycles, allowing for flexible modeling of complex, bidirectional dependencies.^[1] The joint probability distribution in an MRF factorizes over the maximal cliques of the graph, expressed as

P(X) = \frac{1}{Z} \prod_{c \in C} \psi_c(X_c),

where

X

denotes the set of all random variables,

C

is the set of cliques,

\psi_c(X_c)

are non-negative potential functions defined on the variables

X_c

in clique

c

, and

Z

is the normalization constant (partition function) ensuring the distribution sums to one. This factorization captures local interactions, with potentials often modeling compatibility or energy terms for subsets of variables.^[12] MRFs exhibit key probabilistic properties, including the local Markov property, which states that a variable is conditionally independent of all non-neighbors given its neighbors in the graph.^[1] The Hammersley-Clifford theorem establishes that any positive MRF (with strictly positive joint distribution) is equivalent to a Gibbs distribution, linking the factorization to energy-based formulations where potentials derive from an underlying energy function. A classic example is image denoising, where pixel intensities form the random variables on a grid graph, and pairwise potentials between neighboring pixels encourage smoothness by penalizing large intensity differences, facilitating restoration of noisy images through probabilistic inference.^[11]

Hybrid and Other Models

Hybrid graphical models, such as chain graphs, integrate directed and undirected edges to represent mixed dependencies in probabilistic systems, allowing for both acyclic directed components (like Bayesian networks) and cyclic undirected components (like Markov random fields) within the same structure. Introduced by Cox and Wermuth in 1993, chain graphs model linear dependencies through chain components that separate sets of variables, enabling the representation of conditional independencies that neither pure directed nor undirected models can fully capture.^[13] A later formulation by Andersson, Madigan, and Perlman in 2001 provided alternative Markov properties for chain graphs, emphasizing recursive factorizations and global/local Markov compatibilities to handle more complex independence structures.^[14] Factor graphs offer a bipartite graphical representation that generalizes both directed and undirected models by distinguishing between variable nodes and factor nodes, where factors encapsulate local functions over subsets of variables. This separation facilitates efficient computation of marginals via message-passing algorithms, such as the sum-product algorithm introduced by Kschischang, Frey, and Loeliger in 2001, which unifies inference across diverse model types.^[15] Factor graphs are particularly advantageous for expressing the joint probability distribution as a product of factors, providing a flexible framework for non-standard dependencies without the need for orienting edges.^[15] Other specialized variants extend graphical models to specific domains. Dynamic Bayesian networks (DBNs), as detailed by Murphy in 2002, model time-series data by unfolding static Bayesian networks across time slices, capturing temporal dependencies through intra-slice directed edges and inter-slice links, which is essential for applications like speech recognition and bioinformatics.^[16] Gaussian graphical models (GGMs), rooted in the work of Lauritzen (1996), represent multivariate normal distributions using undirected graphs where edges encode partial correlations; the precision matrix (inverse covariance) sparsity directly corresponds to conditional independences, making GGMs ideal for high-dimensional data analysis in genomics and finance.^[17] A prominent example of a hybrid model is the conditional random field (CRF), proposed by Lafferty, McCallum, and Pereira in 2001, which combines undirected graphical structures for output dependencies with directed conditioning on input observations, particularly suited for sequence labeling tasks like part-of-speech tagging.^[18] CRFs avoid the label bias problem of maximum entropy Markov models by globally normalizing the distribution over label sequences, leveraging clique potentials in the undirected graph to model smooth transitions and contextual features.^[18]

Probabilistic Foundations

Conditional Independence

Conditional independence is a fundamental concept in probability theory that plays a central role in graphical models, where it refers to the property that two sets of random variables

X

and

Y

are independent given a third set

Z

, denoted

X \perp Y \mid Z

, if and only if their joint conditional distribution factorizes as

P(X, Y \mid Z) = P(X \mid Z) P(Y \mid Z)

for all values of

Z

.^[19] This property holds almost surely and simplifies probabilistic computations by allowing the decomposition of complex joint distributions into smaller, more manageable factors, reducing the dimensionality of inference tasks in high-dimensional spaces.^[19] In graphical models, conditional independence is encoded directly by the graph structure, enabling efficient representation and manipulation of dependencies without explicit enumeration of all probabilistic relations.^[19] In directed graphical models, such as Bayesian networks, conditional independence is determined by the d-separation criterion, which assesses whether paths between nodes in sets

X

and

Y

are blocked by evidence in

Z

. A path is d-separated (or blocked) by

Z

if every path from

X

Y

contains either a chain or fork (arrows meeting head-to-tail or tail-to-tail at a node in

Z

) or a collider (arrows meeting head-to-head at a node not in

Z

nor among its descendants). If all paths are blocked, then

X \perp Y \mid Z

holds for any distribution faithful to the graph; this criterion was formalized by Pearl to capture all independencies implied by the directed acyclic graph structure. For undirected graphical models, or Markov random fields, the separation criterion applies instead: sets

X

and

Y

are conditionally independent given

Z

Z

separates

X

from

Y

in the graph, meaning no path connects

X

and

Y

after removing nodes in

Z

and their incident edges.^[19] These graphical criteria have key implications for model representation and conversion. In directed models, d-separation enables the joint probability distribution to factorize according to the graph, as

P(\mathbf{x}) = \prod_i P(x_i \mid \mathbf{pa}_i)

, where

\mathbf{pa}_i

are the parents of

x_i

, directly tying independencies to computational efficiency.^[19] Moralization, a process to equate directed and undirected representations, involves adding undirected edges between all pairs of parents of common children and then removing arrow directions, preserving the separation properties while allowing inference via undirected methods.^[20] This equivalence highlights how conditional independence structures support factorization across model types, though additional independencies may exist beyond those strictly encoded.^[19] A illustrative example occurs in a Bayesian network modeling vehicle failure, with nodes for battery status

B

, fuel level

F

, and gauge reading

G

(where

B \to G \leftarrow F

). Marginally,

B \perp F \mid \emptyset

due to the collider at

G

blocking the path; however, observing

G = 0

(low gauge) induces dependence, as

B \not\perp F \mid G=0

, exemplifying "explaining away" where evidence on the common child correlates the independent causes.^[19] Such dynamics underscore how graph structure reveals nuanced independence relations, informing the decomposition of joint distributions into local conditionals.^[19]

Joint Probability Distributions

Graphical models provide a framework for representing multivariate joint probability distributions

P(\mathbf{X})

, where

\mathbf{X} = (X_1, \dots, X_n)

denotes a set of random variables, by decomposing the distribution into local factors that correspond to the graph's structure. This decomposition exploits conditional independencies encoded in the graph to reduce the computational complexity of specifying and manipulating the full joint distribution, which would otherwise require storing an exponential number of probabilities in the number of variables. For graphs with low treewidth—a measure of the graph's "width" when decomposed into a tree—the joint can be represented with storage and computation polynomial in the treewidth rather than exponential in

n

. In directed graphical models, also known as Bayesian networks, the joint distribution factorizes according to the factorization theorem:

P(\mathbf{X}) = \prod_{i=1}^n P(X_i \mid \mathrm{Pa}(X_i))

, where

\mathrm{Pa}(X_i)

are the parents of

X_i

in the directed acyclic graph (DAG). This product of conditional probabilities reflects the Markov assumption that each variable is independent of its non-descendants given its parents. Conversely, in undirected graphical models, or Markov random fields, the joint distribution factorizes as

P(\mathbf{X}) = \frac{1}{Z} \prod_{c \in C} \psi_c(\mathbf{X}_c)

, where

C

is the set of cliques in the undirected graph,

\psi_c

are non-negative potential functions over the variables in clique

c

, and

Z = \sum_{\mathbf{X}} \prod_{c \in C} \psi_c(\mathbf{X}_c)

is the normalization constant, or partition function, ensuring the distribution sums to 1. The role of normalization differs markedly between model types. In directed models, the factorization directly yields a normalized joint distribution, as each conditional

P(X_i \mid \mathrm{Pa}(X_i))

is inherently normalized over

X_i

. In undirected models, however, the partition function

Z

must be computed to normalize the unnormalized measure

\tilde{P}(\mathbf{X}) = \prod_{c \in C} \psi_c(\mathbf{X}_c)

; this is tractable exactly when the graph is a tree, where

Z

can be calculated via belief propagation in linear time relative to the number of edges. For general graphs,

Z

computation is #P-hard, motivating approximations in complex structures. A illustrative example of how graph structure influences the joint distribution is the V-structure, or common effect configuration, consisting of variables

A

B

, and

C

with directed edges

A \to C \leftarrow B

. Here,

A

and

B

are independent in the joint distribution—i.e.,

P(A,B) = P(A)P(B)

—despite lacking a direct edge, but become dependent when conditioning on

C

, as captured by

P(A,B \mid C) \neq P(A \mid C)P(B \mid C)

. This non-intuitive dependence arises solely from the shared child

C

, highlighting how the graph encodes subtle probabilistic relationships in the factorization without explicit edges for all marginal dependencies.

Inference Methods

Exact Inference Algorithms

Exact inference algorithms in graphical models compute precise probabilistic quantities, such as marginal distributions or most probable explanations, by leveraging the conditional independence structure encoded in the graph. These methods avoid the exponential cost of enumerating all possible configurations by systematically reducing the problem through graph manipulations. A core objective is to calculate the marginal probability of a variable

X_i

, given by

P(X_i) = \sum_{X \setminus X_i} P(X),

where the summation is over all variables except

X_i

, and

P(X)

is the joint distribution factorized according to the model.^[21] Such computations are exact but often incur high complexity, scaling exponentially with the treewidth of the graph.^[21] Variable elimination is a foundational algorithm for performing exact inference by sequentially eliminating variables from the joint distribution. The process begins by representing the joint probability as a product of factors, then for each variable to eliminate, collects all factors involving it, multiplies them to form a new factor, and sums over the variable's states to produce an intermediate factor involving the remaining variables. The order of elimination affects efficiency; an optimal ordering minimizes the size of intermediate factors, with time and space complexity bounded exponentially by the induced width—the size of the largest clique formed during elimination. This approach, applicable to both directed and undirected models, was formalized as a simple, general method for Bayesian network computations by Zhang and Poole (1994).^[22] Belief propagation, also known as the sum-product algorithm, enables exact inference through message passing on acyclic graphs. In tree-structured models, messages representing partial marginals are propagated from leaf nodes toward a root (inward pass) and then from root to leaves (outward pass), yielding exact marginals for all variables in time linear in the number of edges. Each message from node

u

v

is computed as a function of incoming messages and local potentials, ensuring consistency via the bethe approximation in loopy variants, though exact only on trees. The algorithm originated with Pearl's development of propagation rules for belief networks (1986).^[23] The junction tree algorithm extends message passing to general graphs by first converting the model into an equivalent tree structure. This involves moralizing the directed graph—undirecting edges and adding links between co-parents—followed by triangulation to eliminate cycles, forming maximal cliques. These cliques become nodes in a junction tree, connected by separators that satisfy the running intersection property (every pair of cliques shares their intersection on the path between them) and the Markov property for conditional independence. Inference proceeds via belief propagation on this tree, with messages as clique marginals calibrated through absorption and propagation phases. Introduced by Lauritzen and Spiegelhalter (1988), this method unifies exact inference across model types.^[24] Despite their precision, exact inference algorithms face severe limitations in dense or high-dimensional graphs, where the treewidth can lead to exponential growth in computational resources, often rendering them intractable beyond modest problem sizes.^[21]

Approximate Inference Techniques

Approximate inference techniques are essential for graphical models where exact methods, such as variable elimination or junction tree algorithms, become computationally infeasible due to high dimensionality or graph cycles.^[7] These methods yield heuristic approximations to posterior distributions or marginal probabilities, enabling scalable inference at the cost of potential bias or variance.^[7] They are widely used in applications like computer vision and natural language processing, where models involve thousands of variables.^[25] Sampling-based methods, including Markov chain Monte Carlo (MCMC) and importance sampling, estimate expectations by generating samples from distributions proportional to the target posterior. MCMC techniques, such as Gibbs sampling, construct a Markov chain that converges in distribution to the target by iteratively sampling each variable conditioned on the others; this approach is particularly suited to undirected graphical models like Markov random fields.^[11] Gibbs sampling was originally proposed for Bayesian image restoration, leveraging Gibbs distributions to handle spatial dependencies in pixel labels.^[11] Importance sampling, in contrast, draws from an easily sampleable proposal distribution and reweights samples using the ratio of target to proposal densities, providing unbiased estimates but suffering from high variance in complex models.^[7] Variational inference approximates the intractable posterior

p(\mathbf{z} \mid \mathbf{x})

with a tractable distribution

q(\mathbf{z})

, typically factorized for computational efficiency, by minimizing the Kullback-Leibler (KL) divergence

\mathrm{KL}(q(\mathbf{z}) \parallel p(\mathbf{z} \mid \mathbf{x}))

.^[25] This optimization is equivalent to maximizing the evidence lower bound (ELBO):

\mathcal{L}(q) = \mathbb{E}_{q(\mathbf{z})} \left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] = \mathbb{E}_{q(\mathbf{z})} [\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_{q(\mathbf{z})} [\log q(\mathbf{z})],

which lower-bounds the marginal likelihood

\log p(\mathbf{x})

.^[7] In mean-field variational inference,

q(\mathbf{z}) = \prod_i q(z_i)

assumes variable independence, leading to coordinate ascent updates that iteratively refine each

q(z_i)

.^[25] Loopy belief propagation (LBP) adapts the exact belief propagation algorithm for loopy graphs by iteratively exchanging messages between nodes without resolving cycles, often converging to approximate marginals after fixed-point iteration.^[26] Unlike exact methods on tree-structured graphs, LBP provides no convergence guarantees but empirically yields accurate approximations in domains like stereo vision and coding theory.^[26] A representative application is mean-field variational inference in Markov random fields for image segmentation, where the posterior over pixel labels is approximated by a product distribution, balancing observed image data with pairwise smoothness potentials to infer object boundaries efficiently.^[25]

Learning Graphical Models

Parameter Estimation

Parameter estimation in graphical models involves learning the numerical parameters of a model with a fixed structure from observed data, such as conditional probability tables (CPTs) in directed models or potential functions in undirected models. This process is crucial for tailoring the model to empirical evidence while preserving the graphical representation of dependencies. Assuming the graph structure is known, estimation methods focus on optimizing parameters to best explain the data, often balancing fit and generalization. For complete data, maximum likelihood estimation (MLE) provides a straightforward approach, yielding closed-form solutions based on empirical frequencies. In Bayesian networks, for a variable

X_i

with parents

\mathbf{Pa}(X_i)

, the MLE for each CPT entry

P(X_i = x_i | \mathbf{Pa}(X_i) = \mathbf{pa}_i)

is the relative frequency

\frac{N_{x_i, \mathbf{pa}_i}}{N_{\mathbf{pa}_i}}

, where

N_{x_i, \mathbf{pa}_i}

is the count of instances with that combination and

N_{\mathbf{pa}_i}

is the count of the parent configuration. This method maximizes the likelihood

P(D|\theta) = \prod_{n=1}^N P(\mathbf{x}^{(n)} | \theta)

, where

D

is the dataset and

\theta

the parameters. In undirected graphical models, MLE for clique potentials similarly uses normalized counts, though it may require iterative optimization for complex interactions. When data is incomplete—common in real-world scenarios with missing observations—the expectation-maximization (EM) algorithm extends MLE by iteratively imputing expectations over latent variables. The EM process alternates between an E-step, computing the expected log-likelihood

Q(\theta, \theta^{(t)}) = \mathbb{E}_{Z|D,\theta^{(t)}} [\log P(D,Z|\theta)]

, where

Z

are hidden variables, and an M-step, maximizing

Q

to update

\theta^{(t+1)}

. This converges to a local maximum of the observed-data likelihood, as proven in the seminal EM framework. For graphical models, inference algorithms (e.g., belief propagation) are embedded in the E-step to compute these expectations efficiently. Bayesian estimation offers a principled alternative by incorporating prior knowledge, treating parameters as random variables and computing posteriors. For CPTs in Bayesian networks, Dirichlet priors are conjugate and commonly used: for each

P(X_i | \mathbf{Pa}(X_i))

, a Dirichlet distribution

\text{Dir}(\alpha_{x_i, \mathbf{pa}_i})

yields a posterior also Dirichlet, with MAP estimates

\hat{P}(X_i = x_i | \mathbf{pa}_i) = \frac{N_{x_i, \mathbf{pa}_i} + \alpha_{x_i, \mathbf{pa}_i} - 1}{N_{\mathbf{pa}_i} + \sum_{x_i'} \alpha_{x_i', \mathbf{pa}_i} - |X_i|}

. The posterior is

P(\theta|D) \propto P(D|\theta) P(\theta)

, enabling full Bayesian inference via sampling or variational methods. This approach mitigates overfitting, especially in sparse data regimes. Challenges in parameter estimation include overfitting due to high-dimensional parameter spaces and sparse observations, particularly when some parent configurations are rare. Regularization techniques, such as pseudocounts in Dirichlet priors (e.g., uniform

\alpha = 1

for Laplace smoothing), add small imaginary samples to counts, improving robustness:

\hat{P} = \frac{N + \lambda}{N + k\lambda}

, where

\lambda > 0

and

k

is the number of outcomes. These methods draw from early work on smoothing in probabilistic models and remain standard for graphical models with limited data.

Structure Discovery

Structure discovery in graphical models involves inferring the underlying graph topology from observational data, a process essential for capturing conditional dependencies without prior knowledge of the structure. This task is computationally challenging due to the combinatorial explosion of possible graphs, particularly for high-dimensional data, and is typically approached through two main paradigms: score-based and constraint-based methods. Score-based methods evaluate candidate structures using a scoring function that balances data fit and model complexity, while constraint-based methods rely on statistical tests for conditional independence to prune edges and orient them. These approaches assume the data are generated from a faithful graphical model, where the graph encodes all independencies in the joint distribution. Score-based structure learning treats the problem as an optimization task, searching over possible directed acyclic graphs (DAGs) to maximize a decomposable score that can be computed locally for substructures. Decomposable scores, such as the Bayesian Information Criterion (BIC) and the Bayesian Dirichlet equivalent uniform (BDeu) score, are widely used because they allow efficient hill-climbing or greedy searches. The BIC score, derived from asymptotic approximations to the marginal likelihood, is defined as

\text{BIC} = \log L - \frac{k}{2} \log n,

where

L

is the likelihood of the data given the model,

k

is the number of parameters, and

n

is the sample size; this penalizes overly complex models to prevent overfitting. The BDeu score, a Bayesian alternative, incorporates prior knowledge via an equivalent sample size and Dirichlet priors, making it suitable for small datasets by smoothing parameter estimates. Search strategies range from exhaustive enumeration for small variable sets to heuristic methods like the K2 algorithm, which performs a greedy local search starting from an empty graph and ordering variables to add parents incrementally based on score improvements. The K2 algorithm, while efficient, can get trapped in local optima but has been foundational for scalable learning in Bayesian networks. Constraint-based methods, in contrast, build the graph skeleton by iteratively testing for conditional independencies between variables, eliminating edges where independence holds given a conditioning set. The PC algorithm exemplifies this approach, using statistical tests such as the χ² test for discrete data or partial correlation for continuous data to identify direct dependencies. It begins with a complete undirected graph, removes edges based on unconditional and conditional independence tests (increasing the conditioning set size gradually), and then orients edges using v-structures and collider rules to form a DAG, often augmented with additional propagation rules for completeness. This method is particularly effective in high dimensions but requires careful selection of significance levels to avoid errors from multiple testing. Despite their strengths, both paradigms face significant challenges. Computational cost escalates rapidly with the number of variables, as exhaustive searches are NP-hard and even greedy or constraint-based methods scale poorly beyond moderate dimensions without approximations. Moreover, these algorithms rely on the faithfulness assumption, which posits that every conditional independence in the distribution is reflected by a d-separation in the graph (and vice versa), an idealization that may not hold for real-world data exhibiting near-independencies or latent confounders. Violations of faithfulness can lead to incomplete or incorrect structures, underscoring the need for robustness checks and hybrid methods combining scores and constraints.

Applications

In Machine Learning and AI

Graphical models play a pivotal role in machine learning and artificial intelligence, particularly for tasks involving probabilistic reasoning, uncertainty quantification, and prediction under partial observability. In causal inference, directed acyclic graphs (DAGs) enable the identification of intervention effects through do-calculus, which formalizes rules for adjusting probabilities under hypothetical interventions without requiring experimental data. This framework, introduced by Judea Pearl, allows for answering "what-if" queries by manipulating graph structures to compute interventional distributions, such as P(Y|do(X)), distinguishing causal from associational effects. For instance, in AI systems for decision-making, DAGs facilitate counterfactual reasoning, estimating outcomes in alternate scenarios to support robust policy evaluation in domains like healthcare and economics. Hidden Markov models (HMMs), a type of dynamic Bayesian network, are widely used for sequence prediction in machine learning, modeling temporal dependencies through hidden states and observable emissions. Seminal work by Baum and colleagues established the expectation-maximization (Baum-Welch) algorithm for parameter learning in HMMs, enabling unsupervised training from sequential data. In natural language processing (NLP), HMMs underpin part-of-speech tagging, where hidden states represent syntactic categories and observations are words, achieving high accuracy on tasks like the Penn Treebank corpus by leveraging Viterbi decoding for inference. In robotics, HMMs support activity recognition and motion planning; for example, they model robot trajectories as state sequences to predict and adapt to environmental changes, as demonstrated in learning from demonstrations for manipulation tasks.^[27] Gaussian processes (GPs) extend graphical models to non-parametric regression, interpretable as infinite factor graphs where the covariance function defines dependencies across an infinite set of latent variables. This perspective, rooted in Bayesian nonparametrics, views GPs as limits of finite-dimensional Gaussian Markov random fields, allowing scalable approximations via inducing points for large datasets. In machine learning, GPs excel in regression tasks requiring uncertainty estimates, such as hyperparameter optimization in AI pipelines (e.g., Bayesian optimization for neural network tuning), outperforming parametric models in small-data regimes by providing principled confidence intervals. The seminal formulation in Rasmussen and Williams highlights how GPs unify kernel methods with probabilistic inference, facilitating applications like spatial prediction in reinforcement learning environments. A prominent example of graphical models in machine learning is the Naive Bayes classifier, a simple Bayesian network assuming conditional independence among features given the class label, forming a star-shaped graph with the class node as the root. This model is particularly effective for text classification, where features are word counts or TF-IDF vectors, enabling efficient spam detection and sentiment analysis via maximum a posteriori estimation. Empirical studies show Naive Bayes rivals more complex models on high-dimensional text data, such as the 20 Newsgroups dataset, due to its robustness to irrelevant features and linear inference time. Its graphical structure allows easy extension to incorporate dependencies, bridging to more general Bayesian networks in AI classification pipelines.

In Other Domains

Graphical models have found extensive applications in bioinformatics, where they model complex biological systems involving uncertainty and dependencies. In gene regulatory networks, Bayesian networks are widely used to infer regulatory relationships among genes from expression data, capturing probabilistic dependencies that represent how transcription factors influence gene activity. For instance, a seminal application involved reconstructing Escherichia coli's regulatory network using Bayesian network structure learning from microarray data, achieving high accuracy in predicting known interactions. Similarly, Markov random fields (MRFs) are employed in protein structure prediction to model spatial dependencies between amino acid residues, facilitating the inference of three-dimensional conformations from sequence information. This approach has been pivotal in methods like those integrating MRFs with energy minimization for folding simulations, improving predictions over independent residue models. In epidemiology, dynamic Bayesian networks extend static models to capture temporal evolution, particularly for simulating disease spread in populations. These networks represent variables such as infection status, contact rates, and intervention effects over time, enabling probabilistic forecasting of outbreaks. A notable example is their use in modeling the spread of infectious diseases like influenza, where hidden Markov models—a special case of dynamic Bayesian networks—are adapted to estimate transmission parameters from surveillance data, aiding in policy decisions. Such models have demonstrated effectiveness in retrospective analyses of epidemics, quantifying uncertainty in parameters like basic reproduction numbers. Computer vision leverages conditional random fields (CRFs) for scene understanding, particularly in object segmentation tasks where pixel-level labels depend on contextual cues. CRFs model the joint distribution of image labels and observations, incorporating unary potentials for individual pixels and pairwise potentials for neighboring interactions, which helps resolve ambiguities in cluttered scenes. This has been applied in semantic segmentation pipelines, such as those combining CRFs with deep convolutional networks, achieving state-of-the-art performance on benchmarks like the PASCAL VOC dataset by refining coarse predictions through structured inference. In evolutionary biology, phylogenetic trees are represented as directed graphical models to infer evolutionary relationships among species from genetic sequences. These models, often Bayesian networks or hidden Markov models, account for substitution probabilities along branches, enabling the reconstruction of ancestral states and divergence times. For example, Bayesian phylogenetic inference using Markov chain Monte Carlo sampling has revolutionized the field by providing posterior distributions over tree topologies, as seen in analyses of primate evolution that incorporate rate variation across sites.

History and Developments

The origins of probabilistic graphical models trace back to early 20th-century work in probability theory and statistical physics. Undirected graphical models, known as Markov random fields (MRFs), have roots in the Ising model proposed by Ernst Ising in 1925 for modeling ferromagnetism, which described spatial dependencies among particles. These concepts were later formalized in statistics by Andrey Markov's work on chains in the 1910s and extended to multidimensional fields by Julian Besag in 1974, enabling applications in spatial statistics and image analysis.^[28] Directed graphical models, or Bayesian networks, emerged in the 1980s amid efforts to handle uncertainty in artificial intelligence and expert systems. Judea Pearl introduced the foundational framework in 1988, building on earlier ideas from Sewall Wright's path analysis in genetics (1921) and I. J. Good's probability propagation (1960s). Pearl's work provided graphical criteria for conditional independence (d-separation) and efficient inference algorithms, revolutionizing probabilistic reasoning.^[29]^[30] The 1990s saw the unification of these approaches under the umbrella of probabilistic graphical models, with key textbooks by Pearl (1988), Lauritzen and Spiegelhalter (1988), and later Koller and Friedman (2009) consolidating theory, learning, and inference methods. This period also addressed computational challenges through approximations like loopy belief propagation.^[10] Developments in the 2000s and 2010s expanded applications to machine learning, with advances in scalable inference for big data, structure learning, and integration with deep learning. As of 2023, ongoing research focuses on causal inference extensions and hybrid models for domains like computer vision and bioinformatics.^[1]

History

Graphical Models

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Graphical Models

History

Ranking

References