Hubbry Logo
Viterbi algorithmViterbi algorithmMain
Open search
Viterbi algorithm
Community hub
Viterbi algorithm
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Viterbi algorithm
Viterbi algorithm
from Wikipedia

The Viterbi algorithm is a dynamic programming algorithm that finds the most likely sequence of hidden events that would explain a sequence of observed events. The result of the algorithm is often called the Viterbi path. It is most commonly used with hidden Markov models (HMMs). For example, if a doctor observes a patient's symptoms over several days (the observed events), the Viterbi algorithm could determine the most probable sequence of underlying health conditions (the hidden events) that caused those symptoms.

The algorithm has found universal application in decoding the convolutional codes used in both CDMA and GSM digital cellular, dial-up modems, satellite, deep-space communications, and 802.11 wireless LANs. It is also commonly used in speech recognition, speech synthesis, diarization,[1] keyword spotting, computational linguistics, and bioinformatics. For instance, in speech-to-text (speech recognition), the acoustic signal is the observed sequence, and a string of text is the "hidden cause" of that signal. The Viterbi algorithm finds the most likely string of text given the acoustic signal.

History

[edit]

The Viterbi algorithm is named after Andrew Viterbi, who proposed it in 1967 as a decoding algorithm for convolutional codes over noisy digital communication links.[2] It has, however, a history of multiple invention, with at least seven independent discoveries, including those by Viterbi, Needleman and Wunsch, and Wagner and Fischer.[3] It was introduced to natural language processing as a method of part-of-speech tagging as early as 1987.

Viterbi path and Viterbi algorithm have become standard terms for the application of dynamic programming algorithms to maximization problems involving probabilities.[3] For example, in statistical parsing a dynamic programming algorithm can be used to discover the single most likely context-free derivation (parse) of a string, which is commonly called the "Viterbi parse".[4][5][6] Another application is in target tracking, where the track is computed that assigns a maximum likelihood to a sequence of observations.[7]

Algorithm

[edit]

Given a hidden Markov model with a set of hidden states and a sequence of observations , the Viterbi algorithm finds the most likely sequence of states that could have produced those observations. At each time step , the algorithm solves the subproblem where only the observations up to are considered.

Two matrices of size are constructed:

  • contains the maximum probability of ending up at state at observation , out of all possible sequences of states leading up to it.
  • tracks the previous state that was used before in this maximum probability state sequence.

Let and be the initial and transition probabilities respectively, and let be the probability of observing at state . Then the values of are given by the recurrence relation[8] The formula for is identical for , except that is replaced with , and . The Viterbi path can be found by selecting the maximum of at the final timestep, and following in reverse.

Pseudocode

[edit]
function Viterbi(states, init, trans, emit, obs) is
    input states: S hidden states
    input init: initial probabilities of each state
    input trans: S × S transition matrix
    input emit: S × O emission matrix
    input obs: sequence of T observations

    prob ← T × S matrix of zeroes
    prev ← empty T × S matrix
    for each state s in states do
        prob[0][s] = init[s] * emit[s][obs[0]]

    for t = 1 to T - 1 inclusive do // t = 0 has been dealt with already
        for each state s in states do
            for each state r in states do
                new_prob ← prob[t - 1][r] * trans[r][s] * emit[s][obs[t]]
                if new_prob > prob[t][s] then
                    prob[t][s] ← new_prob
                    prev[t][s] ← r

    path ← empty array of length T
    path[T - 1] ← the state s with maximum prob[T - 1][s]
    for t = T - 2 to 0 inclusive do
        path[t] ← prev[t + 1][path[t + 1]]

    return path
end

The time complexity of the algorithm is . If it is known which state transitions have non-zero probability, an improved bound can be found by iterating over only those which link to in the inner loop. Then using amortized analysis one can show that the complexity is , where is the number of edges in the graph, i.e. the number of non-zero entries in the transition matrix.

Example

[edit]

A doctor wishes to determine whether patients are healthy or have a fever. The only information the doctor can obtain is by asking patients how they feel. The patients may report that they either feel normal, dizzy, or cold.

It is believed that the health condition of the patients operates as a discrete Markov chain. There are two states, "healthy" and "fever", but the doctor cannot observe them directly; they are hidden from the doctor. On each day, the chance that a patient tells the doctor "I feel normal", "I feel cold", or "I feel dizzy", depends only on the patient's health condition on that day.

The observations (normal, cold, dizzy) along with the hidden states (healthy, fever) form a hidden Markov model (HMM). From past experience, the probabilities of this model have been estimated as:

init = {"Healthy": 0.6, "Fever": 0.4}
trans = {
    "Healthy": {"Healthy": 0.7, "Fever": 0.3},
    "Fever": {"Healthy": 0.4, "Fever": 0.6},
}
emit = {
    "Healthy": {"normal": 0.5, "cold": 0.4, "dizzy": 0.1},
    "Fever": {"normal": 0.1, "cold": 0.3, "dizzy": 0.6},
}

In this code, init represents the doctor's belief about how likely the patient is to be healthy initially. Note that the particular probability distribution used here is not the equilibrium one, which would be {'Healthy': 0.57, 'Fever': 0.43} according to the transition probabilities. The transition probabilities trans represent the change of health condition in the underlying Markov chain. In this example, a patient who is healthy today has only a 30% chance of having a fever tomorrow. The emission probabilities emit represent how likely each possible observation (normal, cold, or dizzy) is, given the underlying condition (healthy or fever). A patient who is healthy has a 50% chance of feeling normal; one who has a fever has a 60% chance of feeling dizzy.

Graphical representation of the given HMM

A particular patient visits three days in a row, and reports feeling normal on the first day, cold on the second day, and dizzy on the third day.

Firstly, the probabilities of being healthy or having a fever on the first day are calculated. The probability that a patient will be healthy on the first day and report feeling normal is . Similarly, the probability that a patient will have a fever on the first day and report feeling normal is .

The probabilities for each of the following days can be calculated from the previous day directly. For example, the highest chance of being healthy on the second day and reporting to be cold, following reporting being normal on the first day, is the maximum of and . This suggests it is more likely that the patient was healthy for both of those days, rather than having a fever and recovering.

The rest of the probabilities are summarised in the following table:

Day 1 2 3
Observation Normal Cold Dizzy
Healthy 0.3 0.084 0.00588
Fever 0.04 0.027 0.01512

From the table, it can be seen that the patient most likely had a fever on the third day. Furthermore, there exists a sequence of states ending on "fever", of which the probability of producing the given observations is 0.01512. This sequence is precisely (healthy, healthy, fever), which can be found be tracing back which states were used when calculating the maxima (which happens to be the best guess from each day but will not always be). In other words, given the observed activities, the patient was most likely to have been healthy on the first day and also on the second day (despite feeling cold that day), and only to have contracted a fever on the third day.

The operation of Viterbi's algorithm can be visualized by means of a trellis diagram. The Viterbi path is essentially the shortest path through this trellis.

Extensions

[edit]

A generalization of the Viterbi algorithm, termed the max-sum algorithm (or max-product algorithm) can be used to find the most likely assignment of all or some subset of latent variables in a large number of graphical models, e.g. Bayesian networks, Markov random fields and conditional random fields. The latent variables need, in general, to be connected in a way somewhat similar to a hidden Markov model (HMM), with a limited number of connections between variables and some type of linear structure among the variables. The general algorithm involves message passing and is substantially similar to the belief propagation algorithm (which is the generalization of the forward-backward algorithm).

With an algorithm called iterative Viterbi decoding, one can find the subsequence of an observation that matches best (on average) to a given hidden Markov model. This algorithm is proposed by Qi Wang et al. to deal with turbo code.[9] Iterative Viterbi decoding works by iteratively invoking a modified Viterbi algorithm, reestimating the score for a filler until convergence.

An alternative algorithm, the Lazy Viterbi algorithm, has been proposed.[10] For many applications of practical interest, under reasonable noise conditions, the lazy decoder (using Lazy Viterbi algorithm) is much faster than the original Viterbi decoder (using Viterbi algorithm). While the original Viterbi algorithm calculates every node in the trellis of possible outcomes, the Lazy Viterbi algorithm maintains a prioritized list of nodes to evaluate in order, and the number of calculations required is typically fewer (and never more) than the ordinary Viterbi algorithm for the same result. However, it is not so easy[clarification needed] to parallelize in hardware.

Soft output Viterbi algorithm

[edit]

The soft output Viterbi algorithm (SOVA) is a variant of the classical Viterbi algorithm.

SOVA differs from the classical Viterbi algorithm in that it uses a modified path metric which takes into account the a priori probabilities of the input symbols, and produces a soft output indicating the reliability of the decision.

The first step in the SOVA is the selection of the survivor path, passing through one unique node at each time instant, t. Since each node has 2 branches converging at it (with one branch being chosen to form the Survivor Path, and the other being discarded), the difference in the branch metrics (or cost) between the chosen and discarded branches indicate the amount of error in the choice.

This cost is accumulated over the entire sliding window (usually equals at least five constraint lengths), to indicate the soft output measure of reliability of the hard bit decision of the Viterbi algorithm.

See also

[edit]

References

[edit]

General references

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Viterbi algorithm is a dynamic programming algorithm that computes the most likely sequence of hidden states—known as the Viterbi path—given a sequence of observed events in a probabilistic model, such as a hidden Markov model (HMM), by maximizing the joint probability of the observations and the state path. It efficiently solves this decoding problem in O(TN2)O(T N^2) time complexity, where TT is the length of the observation sequence and NN is the number of states, avoiding the exponential cost of enumerating all possible paths. Originally proposed by Andrew J. Viterbi in 1967, the algorithm was developed as an asymptotically optimal method for decoding in communication systems, providing tight error bounds for maximum-likelihood sequence estimation on a trellis structure representing code states over time. In his seminal paper, Viterbi demonstrated that the algorithm achieves the minimum possible decoding error probability for rates above the computational cutoff rate, making it essential for error-correcting codes in noisy channels. A 1973 tutorial by G. David Forney Jr. further formalized and analyzed the algorithm's implementation, emphasizing its trellis-based survivor path selection and applications beyond coding, which popularized its use in diverse fields. In the context of HMMs, the Viterbi algorithm was adapted in the late 1960s and 1970s as part of broader work on probabilistic sequence modeling, enabling the inference of hidden state sequences from partial observations; for instance, it initializes the states in the Baum-Welch algorithm for parameter estimation. This adaptation proved pivotal in fields like , where it decodes phonetic sequences from acoustic signals. In bioinformatics, it aligns gene sequences or predicts protein structures by finding the most probable hidden state paths in DNA or amino acid models. Other notable applications include for , digital communications for and equalization, and even satellite broadcasting for error correction in data transmission. The algorithm's efficiency and optimality have made it a foundational tool in and , with ongoing optimizations for large-scale state spaces using techniques like or distance transforms.

Introduction and Background

Overview

The Viterbi algorithm is a dynamic programming designed to determine the most likely of hidden states, referred to as the Viterbi path, in a (HMM) given an observed . This approach addresses the challenge of decoding by identifying the path that maximizes the joint probability of the hidden states and the corresponding observations. At its core, the algorithm employs a trellis structure to systematically explore possible state transitions, unlikely paths to avoid the computational of exhaustive . This method ensures an optimal solution without evaluating every conceivable sequence, providing a balance between accuracy and feasibility in probabilistic modeling tasks. A primary advantage of the Viterbi algorithm is its computational efficiency, with a of O(T N²), where T denotes the length of the observation sequence and N the number of possible states, significantly outperforming brute-force alternatives that scale exponentially. Originally developed for error-correcting in communication systems, it was introduced by in 1967.

Hidden Markov Models

A (HMM) is a that represents a system as a where the states are hidden from observation, and only emissions dependent on those states are directly observable. The model consists of a of hidden states S={s1,s2,,sN}S = \{s_1, s_2, \dots, s_N\}, where NN is the number of states, and a sequence of TT observations O={o1,o2,,oT}O = \{o_1, o_2, \dots, o_T\} drawn from an observation alphabet V={v1,v2,,vM}V = \{v_1, v_2, \dots, v_M\}, with MM possible symbols. The HMM is fully specified by three sets of parameters: the state transition probability matrix A=[aij]A = [a_{ij}], where aij=P(qt=sjqt1=si)a_{ij} = P(q_t = s_j \mid q_{t-1} = s_i) for 1i,jN1 \leq i, j \leq N and qtq_t denoting the state at time tt; the emission (or observation) B=[bj(k)]B = [b_j(k)], where bj(k)=P(ot=vkqt=sj)b_j(k) = P(o_t = v_k \mid q_t = s_j) for 1jN1 \leq j \leq N and 1kM1 \leq k \leq M; and the initial state π=[πi]\pi = [\pi_i], where πi=P(q1=si)\pi_i = P(q_1 = s_i) for 1iN1 \leq i \leq N. Collectively, these parameters are denoted as λ=(A,B,π)\lambda = (A, B, \pi). The HMM relies on two key assumptions. First, the first-order for the hidden states, which states that the probability of transitioning to the next state depends only on the current state: P(qtqt1,qt2,,q1)=P(qtqt1)P(q_t \mid q_{t-1}, q_{t-2}, \dots, q_1) = P(q_t \mid q_{t-1}). Second, the are conditionally independent given the state sequence, meaning that each depends solely on the current state and not on previous or future or states: P(oto1,,ot1,q1,,qT,ot+1,,oT)=P(otqt)P(o_t \mid o_1, \dots, o_{t-1}, q_1, \dots, q_T, o_{t+1}, \dots, o_T) = P(o_t \mid q_t). These assumptions simplify the modeling of sequential data where direct state information is unavailable. Given a state sequence Q={q1,q2,,qT}Q = \{q_1, q_2, \dots, q_T\} and observation sequence OO, the joint probability under the model is P(Q,Oλ)=πq1bq1(o1)t=2Taqt1qtbqt(ot),P(Q, O \mid \lambda) = \pi_{q_1} b_{q_1}(o_1) \prod_{t=2}^T a_{q_{t-1} q_t} b_{q_t}(o_t), which factors according to the Markov and independence assumptions. Common notations in the literature include uppercase letters for random variables (e.g., QtQ_t for the state at time tt) and lowercase for realizations (e.g., qtq_t), with the model λ\lambda encapsulating all probabilistic dependencies. While the standard formulation assumes discrete emissions, extensions to continuous observations replace the discrete Bj(k)B_j(k) with continuous probability density functions, such as finite mixtures of Gaussians, to handle real-valued data like acoustic features in speech recognition. The Viterbi algorithm finds the most likely state sequence Q=argmaxQP(QO,λ)Q^* = \arg\max_Q P(Q \mid O, \lambda) for decoding in HMMs.

Historical Development

Origins and Invention

The Viterbi algorithm was invented by Andrew J. Viterbi in 1967 while he was a faculty member in the School of Engineering and Applied Science at the (UCLA). Originally developed as a method for maximum-likelihood decoding of convolutional codes transmitted over noisy digital communication channels, it addressed the need for computationally efficient error correction in bandwidth-limited systems. Viterbi, an Italian-American electrical engineer, formulated the algorithm during his research on , drawing on principles of dynamic programming to find the most probable of code symbols given a received signal corrupted by . The algorithm's foundational ideas were detailed in Viterbi's seminal paper, "Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm," published in the IEEE Transactions on Information Theory in April 1967. In this work, Viterbi not only introduced the decoding procedure but also derived asymptotic error bounds for convolutional codes, demonstrating that the algorithm achieves near-optimal performance as signal-to-noise ratios improve. The motivation stemmed from pressing challenges in space communications during the 1960s, where missions to planets like Venus and Mars required robust error-correcting codes to combat high noise levels in deep-space channels, yet traditional sequential decoding methods demanded excessive computational resources impractical for real-time ground station processing. This need was particularly acute for NASA's early planetary explorations, which relied on convolutional encoding but lacked efficient decoders until Viterbi's innovation. Early recognition of the algorithm's potential came swiftly within the community. By the late 1960s, prototypes based on the Viterbi decoder were developed under contracts, enabling practical implementations for satellite and deep-space . In the 1970s, adopted Viterbi decoding for key missions, including the Voyager spacecraft launched in 1977, which used a rate-1/2, constraint-length-7 decoded via the algorithm to achieve reliable data recovery from billions of miles away. This integration extended to international standards, with the Consultative Committee for Space Data Systems (CCSDS) incorporating Viterbi-based convolutional coding into its recommendations for deep-space by the early , building on NASA's prior implementations.

Key Milestones

In the , the Viterbi algorithm gained traction in digital communications following its formalization through the trellis structure introduced by G. David Forney in 1973, which provided a graphical representation that simplified implementation and analysis for decoding. This advancement enabled efficient hardware realizations and contributed to its adoption in early satellite and spacecraft systems, such as those developed by and military applications, marking its transition from theory to practical use in noisy channels. During the 1980s, the algorithm expanded into , notably integrated into IBM's Tangora system, a speaker-dependent isolated-utterance recognizer that scaled to 20,000-word vocabularies using hidden Markov models (HMMs) for real-time processing. Early applications also emerged in bioinformatics for , leveraging HMMs to model probabilistic alignments in . Additionally, ideas for parallelizing the algorithm to suit hardware constraints were explored, as in J.K. Wolf's 1978 work on efficient decoding architectures, paving the way for high-throughput implementations. In the 1990s and 2000s, the Viterbi algorithm became standardized in GSM mobile networks for channel decoding of convolutional codes, underpinning error correction in second-generation cellular systems and enabling reliable voice and data transmission worldwide. It also found application in GPS signal processing, where it decodes the convolutional encoding of navigation messages to improve accuracy in low-signal environments. Open-source tools further democratized its use, such as the Hidden Markov Model Toolkit (HTK) released in 1995, which incorporated Viterbi decoding for HMM training and sequence inference in speech and beyond. Post-2010 developments have extended the algorithm to for error correction, including quantum variants applied to quantum low-density parity-check (qLDPC) codes as surveyed in 2015, enhancing fault-tolerant processing. Hybrid approaches integrating neural networks with Viterbi decoding have also proliferated in AI, such as convolutional neural network-HMM systems for improved sequence recognition since the early .

Core Algorithm

Description

The Viterbi algorithm employs a trellis as its graphical foundation, representing the (HMM) over time steps t=1t = 1 to TT on the horizontal axis and the NN possible hidden states on the vertical axis, with edges between states at consecutive time steps weighted by the product of transition probabilities aija_{ij} and emission probabilities bj(ot)b_j(o_t). The algorithm proceeds via dynamic programming to compute the most likely state sequence, denoted as the Viterbi path, that maximizes the joint probability of the observed sequence and the hidden states given the HMM parameters. The process begins with initialization at time t=1t = 1: for each state i=1i = 1 to NN, set the Viterbi probability δ1(i)=πibi(o1)\delta_1(i) = \pi_i b_i(o_1), where πi\pi_i is the initial state probability, and initialize the backpointer ψ1(i)=0\psi_1(i) = 0. This step establishes the probability of starting in each state and emitting the first observation o1o_1. In the recursion phase, for each time step t=2t = 2 to TT and each state j=1j = 1 to NN, δt(j)=maxi=1N[δt1(i)aij]bj(ot),\delta_t(j) = \max_{i=1}^N \left[ \delta_{t-1}(i) a_{ij} \right] b_j(o_t), with the corresponding backpointer ψt(j)=argmaxi=1N[δt1(i)aij].\psi_t(j) = \arg\max_{i=1}^N \left[ \delta_{t-1}(i) a_{ij} \right]. These recursions propagate the maximum probability paths forward through the trellis, selecting at each node jj the predecessor state ii that yields the highest probability up to time tt, scaled by the emission probability for oto_t. To mitigate numerical underflow from repeated multiplications of small probabilities, a common variant computes in the log-probability domain, replacing products with sums and using logδt(j)=maxi[logδt1(i)+logaij]+logbj(ot)\log \delta_t(j) = \max_i \left[ \log \delta_{t-1}(i) + \log a_{ij} \right] + \log b_j(o_t). At termination, after processing all TT observations, the maximum path probability is P=maxi=1NδT(i)P^* = \max_{i=1}^N \delta_T(i), and the final state is qT=argmaxi=1NδT(i)q_T^* = \arg\max_{i=1}^N \delta_T(i). The optimal state sequence, or Viterbi path, is then recovered via : for t=T1t = T-1 down to 11, set qt=ψt+1(qt+1)q_t^* = \psi_{t+1}(q_{t+1}^*). This yields the complete sequence q1,q2,,qTq_1^*, q_2^*, \dots, q_T^* that maximizes the probability. The algorithm's optimality follows from the dynamic programming principle applied to the acyclic trellis graph: the maximum-probability path to any node at time tt is the maximum over all incoming paths from time t1t-1, ensuring global optimality without exhaustive search.

Pseudocode

The Viterbi algorithm for Hidden Markov Models (HMMs) can be implemented using dynamic programming to compute the most likely state sequence given an observation sequence. The algorithm maintains a trellis of probabilities and backpointers to track the optimal path. The inputs to the algorithm are an observation sequence O=o1,o2,,oTO = o_1, o_2, \dots, o_T, where each oto_t is a discrete observation symbol, and the HMM model parameters λ=(A,B,π)\lambda = (A, B, \pi), consisting of the state transition probability matrix A={aij}A = \{a_{ij}\} (where aij=P(qt+1=jqt=i)a_{ij} = P(q_{t+1}=j \mid q_t=i)), the observation emission probability matrix B={bj(k)}B = \{b_j(k)\} (where bj(k)=P(ot=vkqt=j)b_j(k) = P(o_t = v_k \mid q_t = j) for observation symbols vkv_k), and the initial state probability distribution π={πi}\pi = \{\pi_i\} (where πi=P(q1=i)\pi_i = P(q_1 = i)). The output is the most likely state sequence Q=q1,q2,,qTQ = q_1, q_2, \dots, q_T that maximizes P(QO,λ)P(Q \mid O, \lambda). This formulation assumes a discrete emission HMM, as originally applied in contexts like speech recognition. The following pseudocode outlines the core procedure, assuming NN hidden states and using a 2D array V[1..T][1..N]V[1..T][1..N] to store the Viterbi probabilities (the probability of the most likely path ending in state ii at time tt) and a corresponding backpointer[1..T][1..N]backpointer[1..T][1..N] to record the previous state for path reconstruction. Initialization sets the probabilities for the first , recursion computes paths for subsequent observations by maximizing over previous states, termination identifies the best ending state, and reconstructs the full path.

function Viterbi(O, λ = (A, B, π)): T ← length(O) N ← number of states // Initialization for i = 1 to N: V[1][i] ← π_i * b_i(O_1) backpointer[1][i] ← 0 // No previous state // [Recursion](/page/Recursion) for t = 2 to T: for j = 1 to N: temp ← -∞ argmax_i ← 0 for i = 1 to N: prob ← V[t-1][i] * a_{i j} if prob > temp: temp ← prob argmax_i ← i V[t][j] ← temp * b_j(O_t) backpointer[t][j] ← argmax_i // Termination bestpathprob ← max_{i=1 to N} V[T][i] bestpathendstate ← argmax_{i=1 to N} V[T][i] // Path backtracking Q ← [array](/page/Array) of length T Q[T] ← bestpathendstate for t = T-1 downto 1: Q[t] ← backpointer[t+1][Q[t+1]] return Q

function Viterbi(O, λ = (A, B, π)): T ← length(O) N ← number of states // Initialization for i = 1 to N: V[1][i] ← π_i * b_i(O_1) backpointer[1][i] ← 0 // No previous state // [Recursion](/page/Recursion) for t = 2 to T: for j = 1 to N: temp ← -∞ argmax_i ← 0 for i = 1 to N: prob ← V[t-1][i] * a_{i j} if prob > temp: temp ← prob argmax_i ← i V[t][j] ← temp * b_j(O_t) backpointer[t][j] ← argmax_i // Termination bestpathprob ← max_{i=1 to N} V[T][i] bestpathendstate ← argmax_{i=1 to N} V[T][i] // Path backtracking Q ← [array](/page/Array) of length T Q[T] ← bestpathendstate for t = T-1 downto 1: Q[t] ← backpointer[t+1][Q[t+1]] return Q

In practice, direct of probabilities over long sequences can lead to floating-point underflow, as values approach zero. To mitigate this, implementations often use log-scaling by replacing products with sums of logarithms (e.g., log(V)=maxi(log(V[t1])+log(aij))+log(bj(ot))\log(V) = \max_i (\log(V[t-1]) + \log(a_{ij})) + \log(b_j(o_t))) and initializing with -\infty for impossible paths; this transforms the maximization while avoiding numerical instability. The above assumes discrete emissions, where BB provides probabilities for a finite of observations, though extensions exist for continuous densities via Gaussian mixtures or other parameterizations.

Worked Examples

Convolutional Code Decoding

Convolutional codes are linear time-invariant error-correcting codes generated by a finite-state , where the output is a of the input bits and the contents of the register, defined by generator polynomials. A simple rate-1/2 with constraint length 3 (memory of 2 bits) uses generator polynomials g1(D)=1+D2g_1(D) = 1 + D^2 and g2(D)=1+D+D2g_2(D) = 1 + D + D^2, producing two output bits for each input bit through modulo-2 addition in the . This code has a 4-state trellis, with states representing the content of the two memory elements: 00, 01, 10, and 11. The transmission occurs over a binary symmetric channel (BSC) with crossover probability pp, where each transmitted bit is independently flipped with probability p<0.5p < 0.5, resulting in the received sequence being a noisy version of the transmitted codeword with possible bit errors. The Viterbi algorithm decodes by finding the most likely transmitted given the received bits, using branch metrics based on for hard-decision decoding in the BSC model. Consider an example with the 4-state trellis for the rate-1/2 code. The input bit u=1010u = 1010 (with terminating zero) is encoded to the codeword v=11010001v = 11\, 01\, 00\, 01. The received is r=11010011r = 11\, 01\, 00\, 11, which differs from vv in one bit position (the last pair has a single flip from 01 to 11), corresponding to one error in the BSC. The trellis branches are labeled with the input bit and the corresponding output pair; for instance, transitions from each state split into two branches (for input 0 or 1), with outputs determined by the generator polynomials. The following table summarizes the branch labels for the states (state = s1 s2, outputs v1 = u ⊕ s2, v2 = u ⊕ s1 ⊕ s2; next state = u s1):
Current StateInput uOutputNext State
0000000
0011110
0101100
0110010
1000101
1011011
1101011
1110101
The Viterbi algorithm proceeds in three phases: initialization, , and , as detailed in the core algorithm description. For this example, branch metrics are the Hamming distances between the received pair at each time step and the expected output on each branch (0 if matching, 1 for one bit difference, 2 for two). Path metrics δ\delta are the cumulative minimum distances to each state. At time t=1t=1, received pair r1=11r_1 = 11. Assuming start from state 00:
  • Input 0 to state 00: expected 00, metric 2; δ(00)=2\delta(00) = 2
  • Input 1 to state 10: expected 11, metric 0; δ(10)=0\delta(10) = 0 Other states have infinite metrics initially. Survivor pointers point to the initializing paths.
At time t=2t=2, received pair r2=01r_2 = 01:
  • To state 00: from 00 (input 0, expected 00 vs 01, metric 1) total 2 + 1 = 3; no other predecessor. δ(00)=3\delta(00) = 3, pointer from 00.
  • To state 01: from 10 (input 0, expected 01 vs 01, metric 0) total 0 + 0 = 0. δ(01)=0\delta(01) = 0, pointer from 10.
  • To state 10: from 00 (input 1, expected 11 vs 01, metric 1) total 2 + 1 = 3. δ(10)=3\delta(10) = 3, pointer from 00.
  • To state 11: from 10 (input 1, expected 10 vs 01, metric 2) total 0 + 2 = 2. δ(11)=2\delta(11) = 2, pointer from 10.
At time t=3t=3, received r3=00r_3 = 00, the path metrics accumulate similarly, favoring low-error branches. At time t=4t=4, received r4=11r_4 = 11, metric for the transmitted termination branch (from 10 input 0 expected 01 vs 11, metric 1). The survivor paths merge as unlikely paths are pruned; the erroneous branch at t=4t=4 leads to higher cumulative metrics, so the survivor path to the final state favors the original sequence's route (states 00 → 10 → 01 → 10 → 01? Wait, actually backtrack from min δ at t=4, typically to 00). Backtracking from the final state with the minimum δ\delta (total distance 1 for the transmitted path, adjusted for error) traces the pointers backward, recovering the input bits along the survivor path: u = 1010, successfully correcting the single error (decoded as 1010). The trellis diagram consists of four levels (one per time step), with nodes for each state connected by branches labeled with input/output pairs. The survivor paths are marked, showing merging where the error path is discarded, and the correct path dominates by time t=4t=4; visually, it forms a diamond-like structure with pruning lines crossing out non-survivors. This code provides (BER) improvement over uncoded transmission on the BSC; for small pp, the uncoded BER is approximately pp, while the coded BER is bounded by Pb(2k1)pdfree/2P_b \approx (2^k - 1) p^{d_{free}/2}, where dfree=5d_{free} = 5 is the free distance of this code, yielding significant gain (e.g., about 4-5 dB at BER = 10^{-5} for moderate pp).

Sequence Alignment

In bioinformatics, pairwise sequence alignment can be formulated as finding the most probable path in a pair hidden Markov model (HMM), where the Viterbi algorithm efficiently computes the optimal alignment by maximizing the joint probability (or score) of the sequences and the hidden state path. The HMM setup for alignment defines three states: Match (M), where symbols from both sequences are emitted and aligned; Insert (I), where a symbol from the second sequence is emitted (gap in the first); and Delete (D), where a symbol from the first sequence is emitted (gap in the second). Transitions between states incorporate scoring: for instance, matching identical symbols in the M state yields +1, mismatches -1, while opening or extending gaps in I or D states incurs -2. Emissions in M are joint probabilities (or scores) for paired symbols, in I for the second sequence's symbol, and in D for the first sequence's symbol, often derived from substitution matrices like simple identity for DNA. To illustrate, consider aligning DNA sequences X = AGCT and Y = AGCATT using this setup, with observations from Y's symbols and a simple (identity-based scores: +1 match, -1 mismatch, -2 gap). The Viterbi algorithm constructs a trellis with three states (M, I, D) across positions in X and Y, initialized at the start with gap penalties (e.g., v_M(0,0) = 0, v_I(0,0) = v_D(0,0) = -\infty, and handling initial gaps via transitions). Recursion proceeds by maximizing the score at each (i,j) position: For state M at (i,j): max over previous states k of v_k(i-1,j-1) + transition_{kM} + emission_M(x_i, y_j), similarly for I (max from previous, + emission_I(y_j)) and D (+ emission_D(x_i)), where emissions are the substitution or gap scores. The optimal path through the trellis for this example yields aligned sequences A G C T - - and A G C A T T with a total score of 2 (assuming +1 for matches A-A, G-G, C-C, T-T; -2 for each of two I gaps; no mismatches), corresponding to four matches and gap penalties. Backtracking from the maximum final score traces the state sequence M M M I M I, which indicates matches for the first three positions (A-A, G-G, C-C), insert in Y (gap in X) for the fourth ( - / A ), match for the fifth (T / T), and insert in Y (gap in X) for the sixth ( - / T ). This path explicitly outputs the gapped alignments and the operations performed. The Viterbi algorithm serves as a probabilistic generalization of the Needleman-Wunsch dynamic programming method for gapped alignments, where scores can be interpreted as log-probabilities in the pair HMM, enabling extensions to incorporate evolutionary models via emission and transition parameters.

Applications

Error-Correcting Codes

The Viterbi algorithm serves as the primary method for maximum-likelihood sequence detection (MLSD) in decoding convolutional codes and trellis-coded modulation (TCM) schemes within digital communication systems. In convolutional coding, it efficiently navigates the trellis structure to identify the most probable transmitted given noisy received signals, leveraging dynamic programming to minimize computational . For TCM, introduced by Ungerboeck, the algorithm extends MLSD to joint optimization of coding and modulation, achieving bandwidth-efficient error correction by mapping convolutional code outputs to expanded signal constellations without increasing spectral occupancy. Integration of Viterbi decoding appears in key communication standards, including the Global System for Mobile Communications () and Enhanced Data rates for GSM Evolution (EDGE), where it decodes convolutional codes in the full-rate speech to protect voice data against channel errors. In Wi-Fi protocols under IEEE 802.11a/g/n, Viterbi decodes rate-compatible punctured convolutional codes (e.g., rate 1/2 with constraint length 7) for data and control channels, enabling robust high-speed wireless links. Additionally, in architectures, Viterbi decodes the constituent convolutional component codes during iterative processing, contributing to near-Shannon-limit performance in hybrid setups. Hardware implementations of Viterbi decoders in and FPGAs support real-time operation in modern systems, with designs achieving throughputs exceeding 100 Mbps for constraint length 7 codes, as demonstrated in LTE-compatible processors. These realizations often incorporate radix-2 or higher parallelism in the add-compare-select operations to meet latency constraints while consuming low power, typically under 100 mW for mobile applications. Despite its , the Viterbi algorithm faces computational challenges for high-rate or long-constraint-length codes due to in trellis states, leading to high memory and processing demands. Mitigation strategies include survivor path pruning to limit retained paths and list-output Viterbi decoding, which generates multiple candidate sequences for subsequent correction, reducing by up to 50% in bandwidth-limited scenarios without significant loss. Performance benchmarks highlight the algorithm's impact, with a rate 1/2, constraint length 7 (often denoted as (2,1,7)) providing approximately 5 dB coding gain at a (BER) of 10^{-5} over uncoded transmission in channels, using soft-decision Viterbi decoding. This gain establishes its suitability for reliable in noisy environments, though it plateaus for very low BER due to inherent code limitations.

Speech and Natural Language Processing

In speech , the Viterbi algorithm plays a central role in hidden Markov models (HMMs) that represent phonemes or words as sequences of states, where acoustic features serve as observations and the algorithm identifies the most likely state path through likelihood scores. For instance, systems like CMU Sphinx employ Viterbi decoding to align audio inputs with phonetic models, enabling efficient search over large vocabularies in continuous speech. Similarly, the toolkit integrates Viterbi-based search within its HMM-GMM framework to optimize recognition paths, supporting speaker-independent . In , the Viterbi algorithm facilitates sequence labeling tasks by finding the highest-probability tag sequence given word observations in HMM-based models. For , states correspond to grammatical tags (e.g., , ), and emissions model word-tag compatibilities, with Viterbi decoding the optimal path to assign tags efficiently. It extends to , where states represent entity types (e.g., person, location) and Viterbi resolves ambiguities in contextual labeling. In , early statistical systems used Viterbi approximations for decoding word alignments and phrase sequences under probabilistic constraints. The Viterbi algorithm integrates with training methods like the Baum-Welch algorithm, an expectation-maximization technique that estimates HMM parameters (transition and emission probabilities) from unlabeled data, after which Viterbi performs decoding on the refined model. In conditional random fields (CRFs), a discriminative extension of HMMs for NLP, Viterbi approximates maximum by computing the most probable label sequence over feature-based potentials, avoiding label issues in maximum entropy Markov models. Modern adaptations incorporate Viterbi decoding into end-to-end neural architectures, such as connectionist temporal classification (CTC) variants post-2014, where it extracts the best alignment path from neural network outputs, enhancing efficiency in hybrid CTC-attention models for low-resource languages.

Extensions and Variants

Soft-Output Variant

The soft-output variant of the Viterbi algorithm, known as the soft-output Viterbi algorithm (SOVA), extends the standard hard-decision Viterbi decoder by providing not only the most likely sequence but also reliability measures for each decoded bit, enabling the exchange of extrinsic information in iterative decoding schemes. This is particularly valuable in concatenated coding systems, where the hard Viterbi output alone limits performance by discarding probabilistic information from the received observations, whereas soft outputs allow subsequent decoders to refine estimates through iterations. The core modification to the algorithm involves extending the forward to track the two best paths reaching each state, rather than just the single best path, to facilitate reliability assessment during . In the traceback phase, for each decoded bit along the surviving path, the algorithm identifies the first position where an alternative path (differing in that bit) merges with the best path; the path metric difference Δ is then computed as Δ = δ_best - δ_alternative, where δ_best and δ_alternative are the cumulative metrics of the best and competing paths, respectively, serving as a measure of decision . At termination, the log-likelihood ratio for each bit is approximated as L = log(P(bit=0|O)/P(bit=1|O)) ≈ (1/2) min(Δ over conflicting paths), where O denotes the observations; this provides a soft value whose sign indicates the hard decision and magnitude reflects reliability. Compared to the full BCJR algorithm, which computes exact maximum (MAP) symbol probabilities using forward-backward recursions, SOVA offers a lower-complexity of these probabilities, achieving similar in iterative contexts with computational demands approximately twice that of the standard Viterbi algorithm O(T N^2), comparable to BCJR, where T is the sequence length and N the number of states. SOVA found early application in , as proposed by Berrou et al. in 1993, where it serves as a component decoder to generate soft outputs for iterative exchange between parallel convolutional decoders, approaching Shannon-limit . It has also been employed in low-density parity-check (LDPC) decoding for concatenated systems and is integrated into (UMTS) and (LTE) standards under specifications for efficient turbo decoding in mobile communications.

Parallel and Approximate Versions

The Viterbi algorithm's sequential nature limits its scalability for high-throughput applications, prompting the development of parallel variants that distribute computations across multiple units. One early approach involves block-based , where the trellis is divided into independent segments, with survivor paths synchronized at block boundaries to maintain optimality. This method enables concurrent of add-compare-select (ACS) operations across blocks, achieving throughput proportional to the number of parallel units while preserving the maximum-likelihood decoding property. A seminal hardware-oriented parallelization uses architectures, which map the ACS operations onto a linear or two-dimensional of elements that propagate data in a pipelined manner, minimizing inter-processor communication and enabling real-time decoding for convolutional codes in the . These designs, such as those employing locally connected VLSI structures, reduced latency for constraint lengths up to 7 by exploiting the algorithm's regularity, with implementations demonstrating real-time decoding on early custom chips. Approximate methods address the exponential complexity of the full Viterbi search, particularly for large state spaces or long sequences, by or reducing the trellis exploration while aiming for near-optimal performance. , a common , retains only the top-K most probable paths at each time step, discarding lower-scoring branches to limit the active state set and reduce computational load from O(T \cdot 2^K) to O(T \cdot M^2), where T is the sequence length, K the constraint length (with 2^K states), and M the beam width. In reduced-state sequence detection, techniques like the M-algorithm merge states or use decision feedback to construct a trellis with fewer nodes, such as partitioning constellations into subsets for partial-response channels, achieving up to 50% state reduction with minimal error increase in high-order modulation scenarios. For further complexity mitigation in resource-constrained environments, the list Viterbi algorithm extends the search to output the K best surviving paths, incurring O(T \cdot K \cdot N) where N is the number of states, rather than a single path, enabling applications requiring multiple hypotheses like error correction in concatenated codes. This variant finds utility in massive systems for , where it supports low-latency detection in multi-user scenarios by listing candidate sequences for subsequent processing, with implementations showing feasible operation for up to 64 antennas under practical list sizes of K=8-16. Trade-offs in approximate methods, such as , often yield near-maximum-likelihood performance; for instance, in continuous tasks, a beam width of 200-500 paths can reduce runtime by 20-40% compared to full search while maintaining word error rates within 1-2% of optimal on benchmark corpora like WSJ. Recent advances leverage modern hardware for parallel implementations, including GPU and FPGA accelerators post-2010, which exploit massive parallelism for block-processed trellises. GPU-based decoders, such as those using bitslicing to vectorize ACS operations across thousands of threads, achieve throughputs over 1 Gbps for rate-1/2 codes with constraint length 7, outperforming CPU baselines by 10-50x in video decoding pipelines. FPGA variants employ unrolled systolic-like arrays with dynamic reconfiguration, supporting adaptive constraint lengths and delivering real-time performance for channel decoding at latencies under 1 μs. These hardware mappings highlight the algorithm's adaptability, balancing precision with scalability in emerging data-intensive domains.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.