Hubbry Logo
Substitution matrixSubstitution matrixMain
Open search
Substitution matrix
Community hub
Substitution matrix
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Substitution matrix
Substitution matrix
from Wikipedia

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.[1]

Background

[edit]

In the process of evolution, from one generation to the next the amino acid sequences of an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence

ALEIRYLRD

could mutate into the sequence

ALEINYLRD

in one step, and possibly

AQEINYQRD

over a longer period of evolutionary time. Each amino acid is more or less likely to mutate into various other amino acids. For instance, a hydrophilic residue such as arginine is more likely to be replaced by another hydrophilic residue such as glutamine, than it is to be mutated into a hydrophobic residue such as leucine. (Here, a residue refers to an amino acid stripped of a hydrogen and/or a hydroxyl group and inserted in the polymeric chain of a protein.) This is primarily due to redundancy in the genetic code, which translates similar codons into similar amino acids. Furthermore, mutating an amino acid to a residue with significantly different properties could affect the folding and/or activity of the protein. This type of disruptive substitution is likely to be removed from populations by the action of purifying selection because the substitution has a higher likelihood of rendering a protein nonfunctional.[2]

If we have two amino acid sequences in front of us, we should be able to say something about how likely they are to be derived from a common ancestor, or homologous. If we can line up the two sequences using a sequence alignment algorithm such that the mutations required to transform a hypothetical ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to assign a high score to the comparison of the sequences.

To this end, we will construct a 20x20 matrix where the th entry is equal to the probability of the th amino acid being transformed into the th amino acid in a certain amount of evolutionary time. There are many different ways to construct such a matrix, called a substitution matrix. Here are the most commonly used ones:

Identity matrix

[edit]

The simplest possible substitution matrix would be one in which each amino acid is considered maximally similar to itself, but not able to transform into any other amino acid. This matrix would look like

This identity matrix will succeed in the alignment of very similar amino acid sequences but will be miserable at aligning two distantly related sequences. We need to figure out all the probabilities in a more rigorous fashion. It turns out that an empirical examination of previously aligned sequences works best.

Log-odds matrices

[edit]

We express the probabilities of transformation in what are called log-odds scores. The scores matrix S is defined as

where is the probability that amino acid transforms into amino acid , and , are the frequencies of amino acids i and j. The base of the logarithm is not important, and the same substitution matrix is often expressed in different bases.

Example amino-acid matrices

[edit]

PAM

[edit]

One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. Because the use of very closely related homologs, the observed mutations are not expected to significantly change the common functions of the proteins. Thus the observed substitutions (by point mutations) are considered to be accepted by natural selection.

One PAM unit is defined as 1% of the amino acid positions that have been changed. To create a PAM1 substitution matrix, a group of very closely related sequences with mutation frequencies corresponding to one PAM unit is chosen. Based on collected mutational data from this group of sequences, a substitution matrix can be derived. This PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed.

The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. With this assumption, the PAM2 matrix can estimated by squaring the probabilities. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70 are used.

BLOSUM

[edit]

Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem. Henikoff & Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins and will therefore have lower substitution rates than less conserved regions. To reduce bias from closely related sequences on substitution rates, segments in a block with a sequence identity above a certain threshold were clustered, reducing the weight of each such cluster (Henikoff and Henikoff). For the BLOSUM62 matrix, this threshold was set at 62%. Pairs frequencies were then counted between clusters, hence pairs were only counted between segments less than 62% identical. One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.

It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST.

It also turns out the BLOSUM computer code written by Henikoff and Henikoff does not exactly match the description in their paper. Surprisingly, this commonly used "wrong" version has better search performance.[3]

Differences between PAM and BLOSUM

[edit]
  1. PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree: maximum parismony), whereas the BLOSUM matrices are based on an implicit model of evolution.
  2. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
  3. The method used to count the replacements is different: unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same.
  4. Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than BLOSUM50.

Newer matrices

[edit]

A number of newer substitution matrices have been proposed to deal with inadequacies in earlier designs.

  • JTT (1992). Published in the same year as BLOSOM, it also performs clustering and uses an implicit model. This may help reduce the systematic error from maximum parismony (MP), but also wastes sequence information.[4]
  • VTML (2001), a PAM-like matrix based on the alignments in the SYSTERS database, iteratively improved using a maximum likelihood estimator starting from the 1970s Dayhoff PAM model.[5]
  • WAG (Wheelan And Goldman, 2001) uses a maximum likelihood estimating procedure instead of any form of MP over a "BRKALN" dataset. The substitution scores are calculated based on the likelihood of a change considering multiple tree topologies derived using neighbor-joining. The scores correspond to an substitution model which includes also amino-acid stationary frequencies and a scaling factor in the similarity scoring. There are two versions of the matrix: WAG matrix based on the assumption of the same amino-acid stationary frequencies across all the compared protein and WAG* matrix with different frequencies for each of included protein families.[4]
  • PMB (Probability Matrix from Blocks, 2003), a set of "true" substitution frequencies estimated from the observed frequencies of BLOSUM, taking into account the possibility of a later substitution masking a previous one. It thus creates a evolutionary model where the distances have theoretical meaning (BLOSUM does not have this feature, unlike PAM, WAG, and most other later matrices, and hence is not recommended for phylogeny by IQ-TREE).[6]
  • LG (2008), which uses a larger dataset (Pfam-based) than WAG. An extension of the WAG algorithm is used, with a new PhyML (WAG+Γ4) model taking into account of sites with different evolutionary rates.[7]
  • Qmaker and nQmaker (2021, 2022), programs with the ability to estimate time-reversible and nonreversible matrices from very large datasets quickly. Each provide a general matrix and 5 specialized matrices, for a total of 12 precalculated substitution matrices.[8][9]
  • Matrices using a selection of proteins based on structual relatedness, as proposed by Benner et al. (1994), Fan (2004), and Steven et al. (2004).[5]
  • Matrices using structual alignments of proteins instead of simple sequence alignment (6 separate publications).[5]
  • Matrices using known physiochemical parameters of amino acid residues (5 separate publications).[5]

For a list of more models (including irreversible i.e. asymmetric and/or specialized ones), see the documentation for recent bioinformatic software including IQ-Tree,[10] PhyML,[11] and RAxML.[12]

Specialized substitution matrices and their extensions

[edit]

The real substitution rates in a protein depends not only on the identity of the amino acid, but also on the specific structural or sequence context it is in. Many specialized matrices have been developed for these contexts, such as in transmembrane alpha helices,[13] for combinations of secondary structure states and solvent accessibility states,[14][15][16] or for local sequence-structure contexts.[17] These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used.

Since the 2000s, an increasing amount of matrices are defined for subsets of proteins not optimally aligned by traditional "general-purpose" matrices. These include:[5]

  • PfSSM (2008), CBM and CCF (2008) for Plasmodium proteins, which have a different amino acid evolutionary bias due to the low GC content of the genome.
  • Matrices for transmembrane proteins. JTT transmembrane (1994) is the first of the class. Later work include:
    • For alpha-helical transmembrane proteins, PHAT (2000) and SLIM (2001).
    • For beta-barrel transmembrane proteins, bbTM (2008).
  • Matrices for a specific protein family, including GPCRtm (2015) for the transmembrane (mostly helical) regions of GPCRs.
  • Matrices for proteins with a specific role, including Hubsm (2017) for "hub proteins" in protein‐protein interaction networks.
  • Matrices for intrinsically disordered proteins, including DUNMat (2002), MidicMat (2009), Disorder (2010), and EDSSMat (2019).


Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular BLAST program has been demonstrated to achieve a twofold sensitivity improvement for remotely related sequences over BLAST at similar speeds (CS-BLAST).

Nucleotide matrices

[edit]

With nucleotides only having four possible values (in most bioinformatic sequences), the emphasis lies not in setting fixed values in the matrix, but in designing parameterized models that fit the observed evolution of the input sequence as it's being aligned. See Models of DNA evolution.

Terminology

[edit]

Although "transition matrix" is often used interchangeably with "substitution matrix" in fields other than bioinformatics, the former term is problematic in bioinformatics. With regards to nucleotide substitutions, "transition" is also used to indicate those substitutions that are between the two-ring purines (A → G and G → A) or are between the one-ring pyrimidines (C → T and T → C). Because these substitutions do not require a change in the number of rings, they occur more frequently than the other substitutions. "Transversion" is the term used to indicate the slower-rate substitutions that change a purine to a pyrimidine or vice versa (A ↔ C, A ↔ T, G ↔ C, and G ↔ T).

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A substitution matrix, also known as a scoring matrix, is a table used in bioinformatics to assign scores to alignments of biological sequences, such as the 20×20 table for residues in protein sequences or 4×4 for , quantifying the evolutionary likelihood of one residue substituting for another based on observed frequencies. These matrices model substitution rates over evolutionary time, with positive scores for conservative changes that preserve or function, zero for substitutions as likely as chance, and negative scores for disruptive ones, thereby enabling the detection of distant homologies in sequence comparisons. The foundational substitution matrices were the PAM (Point Accepted Mutation) series, introduced by Margaret Dayhoff and colleagues in 1978, which were constructed from 1,572 observed mutations in 71 closely related protein families to estimate evolutionary distances at 1% change (PAM1), with higher-numbered matrices like PAM250 extrapolated for greater divergence. In 1992, Steven Henikoff and Jorja Henikoff developed the (Blocks Substitution Matrix) family, deriving scores from conserved ungapped blocks in the BLOCKS database of distantly related proteins, with becoming the default for many alignment tools due to its balance for moderate evolutionary distances. These empirical matrices, often in log-odds format to reflect probabilistic substitutions relative to chance, form the basis for dynamic programming algorithms in pairwise and multiple sequence alignments. Substitution matrices are integral to bioinformatics applications such as database searching (e.g., BLAST), phylogenetic inference, and , where selecting an appropriate matrix—PAM for close relatives or for distant ones—significantly impacts alignment accuracy and sensitivity. Specialized variants, like those adjusted for compositional biases in transmembrane or disordered proteins, have since extended their utility to niche datasets, underscoring their role in capturing diverse evolutionary patterns across proteomes.

Background

Definition and Purpose

A substitution matrix is a square array of scores that quantifies the favorability or likelihood of one biological residue, such as an or , replacing another during evolutionary processes in aligned sequences. These scores are derived from observed substitution frequencies in related sequences, reflecting the relative ease of based on chemical, physical, and biological of the residues. The primary purpose of substitution matrices is to score alignments in bioinformatics algorithms, such as BLAST and Smith-Waterman, by assigning positive values to conservative substitutions that preserve function and negative values to unlikely ones, thus distinguishing evolutionarily related sequences from random similarities. This enables the detection of distant homologs by modeling evolutionary relatedness rather than exact matches. A common implementation is the log-odds matrix, which compares observed substitution probabilities to those expected under random chance. Substitution matrices improve the accuracy of pairwise and multiple sequence alignments, which form the basis for phylogenetic analysis by informing substitution models in tree construction. They also support by enhancing homology detection and template alignment in comparative modeling approaches. The concept originated in the for modeling protein through empirical substitution data.

Historical Development

The development of substitution matrices began in the with pioneering work by Margaret Dayhoff and her colleagues at the National Biomedical Research Foundation, who analyzed protein mutation probabilities to create the first quantitative models for substitutions. Dayhoff's approach focused on estimating evolutionary changes through observed mutations in closely related protein sequences, laying the groundwork for matrices that could score alignments based on biological realism rather than arbitrary penalties. This effort culminated in the publication of the Atlas of Protein Sequence and Structure in 1978, a comprehensive that included the initial (PAM) matrices derived from 1,572 mutations across 71 protein families, marking a key milestone in bioinformatics by standardizing protein evolution modeling. In the 1980s and , advancements shifted toward more robust matrices suited for diverse evolutionary distances and alignment types. A significant innovation came in 1992 when Steven and Jorja Henikoff introduced the BLOcks SUbstitution Matrix () series, derived from conserved blocks in protein families to capture local alignment patterns, contrasting with the global alignments emphasized in earlier models. This methodology improved sensitivity for detecting distant homologs, and by the late , BLOSUM62 emerged as the default scoring matrix in the Basic Local Alignment Search Tool (BLAST), enhancing the accuracy of database searches for protein similarities. The 2000s saw expansions into context-specific substitution matrices tailored to structural or functional features, such as secondary structures, to refine alignment scores in specialized scenarios. Concurrently, nucleotide substitution matrices gained prominence in molecular evolution studies, with models like the general time-reversible (GTR) matrix enabling better estimation of substitution rates across DNA sequences for phylogenetic inference. Post-2010 trends integrated structural data from the (PDB) with techniques to develop structure-aware matrices that account for three-dimensional conformations. Recent developments from 2020 to 2025, inspired by deep learning advancements like for , have further advanced this by incorporating predicted folds into substitution scoring for improved evolutionary modeling. Notable examples include the 3Di matrix, introduced in 2025, which uses structural alignments for to resolve deep evolutionary relationships previously intractable with sequence-based methods alone.

Basic Substitution Matrices

Identity Matrix

The identity matrix represents the most basic substitution matrix in bioinformatics, assigning positive scores to identical residues and zero or negative scores to non-identical ones, thereby prioritizing exact matches without considering any biological context for changes. It is structured as a diagonal matrix, where the main diagonal elements are positive (commonly +1) and all off-diagonal elements are zero or negative (e.g., 0 or -1). For sequence alignment, the is a 20×20 with +1 scores on the diagonal for the 20 standard and 0 elsewhere, ensuring that only exact matches contribute positively to the alignment score. Similarly, for sequences, it takes the form of a 4×4 matrix over the {A, T, G, C}, with the same diagonal pattern. An example of this is shown below, where matches score +1 and mismatches score 0:
ATGC
A1000
T0100
G0010
C0001
This matrix finds application in exact string matching tasks, as a starting point for developing more sophisticated alignment methods, and in tools like nucleotide BLAST, which employs an identity-based scoring system with a default match score of +1 and mismatch penalty of -3 for rapid searches of closely related sequences. It serves as a baseline in initial database queries where high similarity is expected, facilitating quick identification of near-identical hits. A key limitation of the identity matrix is its inability to account for evolutionary substitutions, resulting in poor performance for aligning divergent sequences where conservative changes between related residues go unrecognized and are penalized equally to unrelated mismatches. In contrast to log-odds matrices designed for evolved sequences, it assumes no probability of meaningful substitutions beyond identity.

Simple Scoring Schemes

Simple scoring schemes in sequence alignment predate empirical substitution matrices and rely on fixed, rule-based assignments rather than observed evolutionary frequencies. One of the most basic approaches is the match-mismatch scoring system, where identical residues receive a fixed positive score, such as +5, and any non-identical substitution incurs a uniform negative penalty, such as -4, regardless of the specific residues involved. This scheme treats all matches equally and all mismatches equivalently, simplifying the alignment process without considering biological context. Another rudimentary method involves chemical similarity schemes, which assign scores based on ad-hoc assessments of physicochemical properties like polarity, , or composition. For instance, Grantham's 1974 distance formula calculates differences between using three properties—side-chain composition, polarity, and —to derive a dissimilarity score, with lower values indicating more similar residues suitable for conservative substitutions. Similarly, Miyata et al. in 1979 proposed a matrix distinguishing two substitution types: conservative changes preserving and polarity, scored more favorably, and radical changes disrupting these properties, scored lower. These schemes aim to reflect biochemical compatibility but rely on groupings rather than data-driven probabilities. Such simple scoring approaches were prevalent in early computational alignments during the 1960s and 1970s, before the advent of empirical matrices. The Needleman-Wunsch algorithm, introduced in 1970, employed a basic match score of +1 for identical and effectively 0 for mismatches, focusing on maximizing the number of matches in global alignments of proteins like and hemoglobins. This era's methods, including manual and initial automated alignments, often incorporated physicochemical heuristics to guide substitutions, as limited sequence data precluded statistical modeling. The can be viewed as a special case of match-mismatch scoring where mismatches receive a score of 0. These schemes offer advantages in computational simplicity and speed, making them suitable for initial implementations on limited hardware of the time, and they provide an intuitive framework for penalizing dissimilar alignments. However, they suffer significant drawbacks by ignoring varying evolutionary substitution rates—such as the higher likelihood of transitions over transversions in or conservative changes in proteins—leading to suboptimal detection of distant homologs compared to empirical matrices like PAM or . For , a typical simple scoring table applies a uniform mismatch penalty, as shown below:
ACGT
A+1-4-4-4
C-4+1-4-4
G-4-4+1-4
T-4-4-4+1
This example assigns +1 to and -4 to all mismatches, optimizing for high-similarity alignments like closely related genes.

Construction of Log-Odds Matrices

Theoretical Principles

Substitution matrices are constructed from observed frequencies of or substitutions in closely related, aligned biological sequences, capturing the evolutionary acceptability of such changes by quantifying how often specific pairs align compared to random expectation. These frequencies reflect patterns of , where substitutions that preserve protein function or structure occur more frequently than those that disrupt it. The scoring of substitutions relies on comparing target frequencies—derived from aligned sequences—to background frequencies, which represent the expected occurrence of residues in unrelated sequences. This approach leverages between residue pairs, measuring the dependency or shared information that indicates evolutionary relatedness beyond chance. Substitutions are thus scored relative to random chance, with positive scores for likely evolutionary alignments and negative for improbable ones. Underlying these matrices are evolutionary models that assume a Markov process for mutations, where the probability of a substitution depends only on the current state and accumulates over time without memory of prior changes. This framework distinguishes conservative substitutions, which involve chemically similar residues and occur more frequently to maintain functional constraints, from radical ones that alter properties and are rarer. The general scoring function introduces a log-odds ratio, expressed as S(i,j)=log(fijfifj)S(i,j) = \log \left( \frac{f_{ij}}{f_i \cdot f_j} \right), where fijf_{ij} is the joint frequency of residues ii and jj in alignments, and fif_i, fjf_j are their marginal frequencies; this provides a probabilistic measure of substitution likelihood. In sequence alignment algorithms, these scores integrate into dynamic programming frameworks, such as the Needleman-Wunsch method, to evaluate pairwise matches against gap penalties and optimize overall alignment scores for detecting homology. Empirical implementations like PAM and matrices apply this theoretical foundation to real data for practical use in bioinformatics tools.

Mathematical Derivation

The construction of log-odds substitution matrices begins with defining key probabilities derived from empirical data. The target frequency MijM_{ij} represents the observed probability that residues ii and jj are aligned in a set of biologically related sequences, estimated by counting aligned pairs across multiple alignments and normalizing by the total number of aligned pairs. The background frequency QiQ_i (or pip_i) is the marginal probability of residue ii occurring in unrelated or random sequences, typically computed as the frequency across a large reference set of sequences, such as Qi=jMijQ_i = \sum_j M_{ij}. The core log-odds score S(i,j)S(i,j) quantifies the relative likelihood of observing the substitution ii to jj under an evolutionary model versus random chance. It is derived as the logarithm of the : S(i,j)=klog(MijQiQj),S(i,j) = k \cdot \log \left( \frac{M_{ij}}{Q_i Q_j} \right), where the denominator QiQjQ_i Q_j assumes between residues under the null (random) model, and kk is a scaling constant. This formula arises from the in statistical alignment: the probability of the data given related s (P(datarelated)=MijP(\text{data} | \text{related}) = M_{ij}) divided by the probability under (P(datarandom)=QiQjP(\text{data} | \text{random}) = Q_i Q_j), with the log ensuring additivity of scores over sequence positions. Variants include the natural logarithm (ln\ln) for computational convenience, base-10 logarithm scaled by 10 (as in early models to yield integer-like values around 0 to 20), or base-2 logarithm for scores in bits (information-theoretic units). For practical use, scaling to full bits (k=1/ln21.442k = 1 / \ln 2 \approx 1.442) or half-bits (k=2/ln22.885k = 2 / \ln 2 \approx 2.885) followed by rounding to integers ensures efficient storage and negative expected scores for random alignments (typically around -0.5 to -4 bits per position). Normalization ensures the matrix's utility in alignment algorithms. (S(i,j)=S(j,i)S(i,j) = S(j,i)) is achieved because MijM_{ij} is undirected in pairwise counts from alignments, though asymmetric forms exist for directed evolutionary models. Diagonals S(i,i)S(i,i) are positive since (Mii>Qi2M_{ii} > Q_i^2) are more probable than random, reflecting conservation. For low-frequency substitutions where Mij0M_{ij} \approx 0, pseudocounts (small uniform additions, e.g., α=0.1\alpha = 0.1 to counts) are incorporated to prevent undefined logs and smooth estimates: adjusted Mij=(countij+α)/(total pairs+20α)M_{ij}' = (count_{ij} + \alpha) / (\text{total pairs} + 20\alpha) for a 20-residue , reducing in sparse data. The derivation relies on several assumptions. Positions in sequences evolve independently, allowing additive log scores without inter-position dependencies. Background frequencies QiQ_i are stationary, remaining constant over evolutionary time despite overall composition shifts. Time-scaling accounts for evolutionary distance: target frequencies MijM_{ij} are specific to a divergence level (e.g., 1% accepted ), and for greater distances, a base probability matrix is raised to a power tt (e.g., via matrix exponentiation) before computing logs, modeling progressive . These assumptions enable the matrix to approximate a continuous-time Markov process for substitutions. Computationally, the matrix is built in four main steps. First, and align closely related sequences (e.g., proteins sharing >85% identity) to capture accepted . Second, aligned pairs: for each column pair in the alignments, increment countijcount_{ij}. Third, estimate frequencies: Mij=countij/k,lcountklM_{ij} = count_{ij} / \sum_{k,l} count_{kl} and Qi=jMijQ_i = \sum_j M_{ij}, optionally applying pseudocounts. Fourth, compute scores: S(i,j)=klog(Mij/(QiQj))S(i,j) = k \cdot \log(M_{ij} / (Q_i Q_j)) and round to integers. As an illustrative example, consider a toy alignment dataset with a reduced {A, B} and 100 aligned pairs yielding counts: 60 AA, 10 AB, 10 BA, 20 BB. Then, M=(0.60.10.10.2),QA=0.7,QB=0.3.M = \begin{pmatrix} 0.6 & 0.1 \\ 0.1 & 0.2 \end{pmatrix}, \quad Q_A = 0.7, \quad Q_B = 0.3. Using natural log scaling (k=1k=1), S(A,A)=ln(0.6/(0.70.7))0.20,S(A,B)=ln(0.1/(0.70.3))0.74.S(A,A) = \ln(0.6 / (0.7 \cdot 0.7)) \approx 0.20, \quad S(A,B) = \ln(0.1 / (0.7 \cdot 0.3)) \approx -0.74. This shows positive matches and negative mismatches. Pseudocode for the derivation (in Python-like syntax) is as follows:

# Step 1-2: Align and count (simplified) counts = initialize_matrix(alphabet_size, zero=True) for alignment in related_alignments: for pos1, pos2 in pairwise_positions(alignment): i, j = residue_at(pos1), residue_at(pos2) counts[i][j] += 1 counts[j][i] += 1 # For symmetry # Step 3: Estimate frequencies with pseudocounts (alpha=0.0001) total_pairs = sum_all(counts) alpha = 0.0001 * len([alphabet](/page/Alphabet)) for i in [alphabet](/page/Alphabet): for j in [alphabet](/page/Alphabet): M[i][j] = (counts[i][j] + alpha) / (total_pairs + len([alphabet](/page/Alphabet))**2 * alpha) Q[i] = sum_j M[i][j] # Step 4: Compute log-odds (k=2 / ln(2) for half-bits) import math k = 2 / math.log(2) for i in alphabet: for j in alphabet: if M[i][j] > 0: S[i][j] = round(k * math.log(M[i][j] / (Q[i] * Q[j]))) else: S[i][j] = round(k * math.log(1e-10)) # Floor for zeros

# Step 1-2: Align and count (simplified) counts = initialize_matrix(alphabet_size, zero=True) for alignment in related_alignments: for pos1, pos2 in pairwise_positions(alignment): i, j = residue_at(pos1), residue_at(pos2) counts[i][j] += 1 counts[j][i] += 1 # For symmetry # Step 3: Estimate frequencies with pseudocounts (alpha=0.0001) total_pairs = sum_all(counts) alpha = 0.0001 * len([alphabet](/page/Alphabet)) for i in [alphabet](/page/Alphabet): for j in [alphabet](/page/Alphabet): M[i][j] = (counts[i][j] + alpha) / (total_pairs + len([alphabet](/page/Alphabet))**2 * alpha) Q[i] = sum_j M[i][j] # Step 4: Compute log-odds (k=2 / ln(2) for half-bits) import math k = 2 / math.log(2) for i in alphabet: for j in alphabet: if M[i][j] > 0: S[i][j] = round(k * math.log(M[i][j] / (Q[i] * Q[j]))) else: S[i][j] = round(k * math.log(1e-10)) # Floor for zeros

This general approach underpins amino acid matrices like the PAM series.

Amino Acid Substitution Matrices

PAM Matrices

Point Accepted Mutation (PAM) matrices, the first empirical substitution matrices for proteins, were developed by Margaret Dayhoff and colleagues in 1978 using data from 71 closely related protein families spanning 34 superfamilies, with a total of 1,572 observed accepted . These matrices model evolutionary changes based on the concept of "accepted ," where 1 PAM (PAM1) represents an average evolutionary distance at which 1% of have changed. The construction begins with a mutation probability matrix (M) derived from alignments of closely related sequences, where entries MijM_{ij} indicate the probability that amino acid ii mutates to jj over 1 PAM unit, accounting for relative mutabilities (e.g., asparagine at 134, tryptophan at 18) and amino acid frequencies (e.g., glycine at 0.089, tryptophan at 0.010). For greater evolutionary distances, the PAMn matrix is obtained by raising the PAM1 matrix to the power nn via matrix multiplication, allowing extrapolation to scenarios like PAM250, where sequences have diverged by 250% (retaining about 20% identity). This probability matrix is then converted to a log-odds scoring matrix, where scores are 10log10(Mij/fi)10 \log_{10} (M_{ij} / f_i), with fif_i as the background frequency of ii, rounded to integers for practical use in alignments. PAM matrices are 20×20 symmetric tables that assign higher positive scores to conservative substitutions reflecting chemical or structural similarity, such as Ile to Val, while penalizing dissimilar changes. They were instrumental in early bioinformatics tools, including the program, which employed PAM matrices for rapid sequence similarity searches. Common variants include PAM120, suitable for general database searches with sequences around 40% identical, and PAM30 for detecting close homologs with higher identity levels. For illustration, a snippet of the PAM250 log-odds matrix highlights these patterns:
W (Trp)G (Gly)I (Ile)V (Val)
W (Trp)+11-4-3-3
G (Gly)-4+7-4-3
I (Ile)-3-4+5+4
V (Val)-3-3+4+4
These values, scaled by a factor of 10 from base-10 logarithms, emphasize self-matches like Trp-Trp at +11 and penalize unlikely changes like Trp-Gly at -4. Despite their foundational role, PAM matrices have limitations, including reliance on a relatively small of 71 families, which may not capture broader evolutionary diversity, and an assumption of uniform mutation rates across lineages without accounting for superimposed mutations in distant comparisons.

BLOSUM Matrices

The BLOSUM (BLOcks SUbstitution Matrix) family of substitution matrices was developed by Steven and Jorja Henikoff in 1992 as an empirical approach to scoring substitutions in protein alignments. Unlike earlier models that relied on evolutionary extrapolations from closely related sequences, BLOSUM matrices are derived directly from observed substitutions in conserved protein blocks extracted from the BLOCKS database, which contains aligned segments of related proteins without gaps. This method uses approximately 2,000 such blocks from over 500 groups of proteins, providing a robust empirical basis for scoring. To construct a BLOSUM matrix, sequences within each block are clustered using a at specified percent identity thresholds to reduce from overrepresented sequences; for example, BLOSUM62 clusters sequences sharing 62% or greater identity, while BLOSUM80 uses a stricter 80% threshold for closely related sequences. Observed frequencies of pairs in these clustered alignments are then used to compute log-odds scores, defined as sij=2log2(fijpipj)s_{ij} = 2 \log_2 \left( \frac{f_{ij}}{p_i p_j} \right), where fijf_{ij} is the observed frequency of substitution between ii and jj, and pip_i and pjp_j are their background frequencies, without any evolutionary extrapolation. This results in a 20×20 scaled in half-bit units, making it particularly effective for local alignments in tools like BLAST. In contrast to PAM matrices, which are based on global alignments of closely related homologs from smaller datasets, BLOSUM draws from diverse, local block alignments. Key features of BLOSUM matrices include positive scores for biologically likely substitutions and negative scores for unlikely ones, reflecting relative frequencies; for instance, the substitution of (D) for (E), which are chemically similar, scores +2, while leucine (L) to (I) also scores +2 due to their hydrophobic similarity. BLOSUM62 has become the default matrix in NCBI's BLAST program for protein searches, balancing sensitivity for distant relationships. Variants are tailored to evolutionary distances: BLOSUM45 is used for detecting more divergent proteins, while BLOSUM90 suits highly similar sequences by emphasizing subtle differences. An excerpt from the BLOSUM62 matrix illustrates these scores (rows and columns ordered alphabetically by single-letter codes: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V):
ARNDCQEGHILKMFPSTWYV
A4-1-2-20-1-10-2-1-1-1-1-2-110-3-20
D-2-216-302-1-1-3-4-1-3-3-10-1-4-3-3
E-1002-425-20-3-31-2-3-10-1-3-2-2
I-1-3-3-3-1-3-3-4-342-310-3-2-1-3-13
L-1-2-3-4-1-2-3-4-324-220-3-2-1-2-11
The full matrix includes diagonal identity scores up to +11 (e.g., for ). Advantages of matrices stem from their large, diverse dataset of thousands of conserved blocks spanning various protein families, enabling better generalization across unrelated proteins compared to smaller, evolutionarily focused datasets in other matrices. This empirical grounding from real alignments enhances performance in detecting remote homologs in local search scenarios.

Differences Between PAM and BLOSUM

The Point Accepted Mutation (PAM) and Blocks Substitution Matrix () represent two foundational approaches to substitution scoring, differing fundamentally in their data foundations and methodological assumptions. PAM matrices are constructed from a small, manually curated set of 71 closely related protein families exhibiting at least 85% sequence identity, capturing 1,572 observed mutations to model evolutionary changes in closely related sequences. In contrast, BLOSUM matrices derive from a larger, automated compilation of approximately 2,000 conserved, ungapped alignment blocks from over 500 diverse protein families in the BLOCKS database, enabling broader representation of substitutions across varying evolutionary distances. A key methodological distinction lies in how these matrices handle evolutionary scaling. PAM matrices start with a base matrix (PAM1) derived from closely related sequences and extrapolate to greater distances through matrix powering—a process of repeated to simulate multiple substitutions over time, yielding variants like PAM250 for ~20% identity alignments. BLOSUM matrices, however, avoid such ; they are computed directly from clustered sequences at specific identity thresholds (e.g., BLOSUM62 from sequences clustered at ≤62% identity), producing empirically observed log-odds scores without relying on a model of evolution. This direct derivation makes BLOSUM more attuned to real-world alignment patterns in diverse datasets. Performance differences emerge prominently in practical applications, particularly for searches and alignments. BLOSUM matrices, especially BLOSUM62, demonstrate superior sensitivity in detecting distant homologs in the "twilight zone" of 20-35% sequence identity, with studies showing up to 20-30% improvements in alignment accuracy and false positive reduction compared to equivalent PAM matrices like PAM120. PAM matrices, by virtue of their evolutionary modeling focus, excel in phylogenetic reconstruction and global alignments of closely related sequences but underperform in local alignments of divergent proteins due to over-extrapolation artifacts. Additionally, PAM score distributions tend to be more negative overall, reflecting a toward penalizing mismatches in close relatives, whereas scores are tuned for local alignments, incorporating fewer gap penalties and emphasizing conserved blocks to better handle insertions/deletions. Selection guidelines favor BLOSUM62 as the default for general-purpose protein sequence comparisons, such as in BLAST searches, due to its balanced performance across evolutionary distances. PAM250, however, is preferred for evolutionary modeling and phylogeny inference where precise divergence estimation is needed. These choices align with their design intents: PAM for theoretical evolutionary simulation and for empirical search optimization. To illustrate score similarities and differences, the following table compares select entries from PAM250 and (both in log-odds units, scaled to approximate half-bits where applicable). Note the close alignment on highly conserved substitutions like Cys-Cys, reflecting shared reliance on observed frequencies, while divergences appear in less conserved pairs.
SubstitutionPAM250 ScoreBLOSUM62 Score
Cys-Cys99
Ala-Ala14
Trp-Trp1111
Ala-Ser11
Cys-Trp-8-2

Recent Developments

Since 2020, advancements in substitution matrices have increasingly incorporated structural data, , and domain-specific adaptations to enhance accuracy in and evolutionary analysis, building briefly on foundational and PAM approaches. A notable structure-aware development is the 3Di substitution matrix, introduced in 2024, which leverages three-dimensional coordinates from the (PDB) to model substitutions based on spatial proximities in protein structures. This matrix facilitates structural by enabling phylogenetic inference directly from 3D alignments, addressing limitations in sequence-based methods for deeply divergent proteins. Applications include resolving challenging evolutionary relationships in protein families where sequence similarity is low, providing a foundation for revisiting complex phylogenetic problems. Machine learning integrations have advanced with AlphaFold2-based substitution matrices, such as those developed in 2024 for assessing pathogenic impacts of motif-specific substitutions in short linear motifs (SLiMs). These matrices use AlphaFold2-predicted structures combined with FoldX-derived stability changes (ΔΔG) to generate motif-domain binding scores, filtering variants from ClinVar and gnomAD databases. For instance, the MotSASi framework achieves 97% accuracy in predicting deleterious single substitutions, outperforming methods like AlphaMissense with an F1-score of 0.98 and Matthews of 0.78 across 2,335 variants, expanding high-confidence SLiM coverage by 22-fold. Reduced-alphabet substitution matrices, also emerging in 2025, group the 20 standard into smaller classes (typically 10-15) based on physicochemical properties and evolutionary conservation to trace ancient coding alphabets in modern proteins. These models depart from fixed alphabets by incorporating (aaRS) informed substitutions, enabling reconstruction of phylogenies from ancient protein datasets and revealing timelines more aligned with Earth's geological . Such approaches provide insights into early evolution without assuming a universal 20-amino-acid framework. Specialized variants include depth-dependent matrices distinguishing substitutions in buried versus surface residues, as explored in studies using structural abundance scores. These matrices predict cellular abundance changes from residue substitutions with high accuracy (Spearman correlation up to 0.73) when conditioned on status, aiding deleterious prediction in varied protein environments. For proteins with biased compositions, such as proteins, updated matrices account for transmembrane-specific frequencies, improving ortholog discrimination and clustering in AT/GC-rich genomes, as reviewed in with ongoing refinements. Examples of practical impact include the 2021 ProtSub matrix, which incorporates coevolutionary pairs from alignments and spatial filters (≤4.5 Å) to align sequences with s, yielding significant gains in homology detection for twilight-zone sequences across 4,184 CATH families—enhancing congruence with structure matches in over 2,000 cases. Overall, these innovations have improved alignment accuracy for distant homologs by up to 20% in benchmark tests. Looking ahead, future trends point to AI-driven dynamic matrices tailored for personalized genomics, integrating molecular dynamics simulations and deep learning to adapt substitutions to individual variant contexts, as seen in emerging frameworks like Dynamicasome for precision medicine applications.

Nucleotide Substitution Matrices

Standard Models

Standard nucleotide substitution matrices are simpler than their protein counterparts due to the limited alphabet of four bases (A, C, G, T), often serving as baselines for DNA or RNA sequence alignments in phylogenetic analyses. The most basic form is the identity matrix, a 4×4 diagonal matrix where exact matches receive a positive score (typically +1) and all mismatches are scored as 0 or a uniform negative value, emphasizing identical nucleotides while penalizing any substitution equally. This matrix is particularly useful for closely related sequences where conservation is high, and it is implemented by default in tools like Clustal Omega for nucleotide alignments. To account for evolutionary patterns, more refined match-mismatch matrices distinguish between transitions (substitutions between purines A↔G or pyrimidines C↔T, which occur more frequently) and transversions (cross-substitutions like A↔C, which are rarer and often more disruptive). In these models, transitions receive higher (less penalizing) scores than transversions—for instance, a common scheme assigns +5 to matches, +1 to transitions, and -4 to transversions—to better reflect observed mutation biases in DNA evolution. Such matrices improve alignment accuracy for moderately divergent sequences by reducing noise from unlikely changes. Empirical nucleotide substitution matrices extend these ideas by deriving scores from real data, such as aligned genomes or conserved genes, often incorporating models like HKY85, which allows unequal base frequencies and a (κ) for the transition/ rate ratio. The HKY85 model, originally developed for analysis, is integrated into phylogenetic software to generate context-aware matrices that capture heterogeneous substitution patterns across taxa. These matrices are constructed using log-odds principles adapted from protein alignments, where observed substitution frequencies from multiple sequence alignments of conserved regions are compared against background frequencies (e.g., via Bayesian integrals or pseudocounts) to yield scores for the 4×4 matrix, benefiting from the smaller for computational efficiency. For illustration, a representative 4×4 empirical-style substitution matrix might appear as follows, with positive scores for matches and transitions, and negatives for transversions (values rounded for clarity; actual scores vary by dataset):
AGCT
A+1+0.5-1-1
G+0.5+1-1-1
C-1-1+1+0.5
T-1-1+0.5+1
This example prioritizes purine/pyrimidine conservation while penalizing cross-type changes, derived from frequency counts in aligned sequences. These standard models are widely applied in tools like for DNA phylogeny reconstruction, where they facilitate tree building from multiple alignments by scoring evolutionary relatedness; their relative uniformity compared to matrices stems from the fewer possible states, limiting the need for extensive clustering or divergence-specific variants.

Specialized Variants

Specialized substitution matrices extend standard models by incorporating biological constraints such as transition-transversion biases, codon-level selection, RNA secondary structures, and context-specific patterns observed in recent viral epidemics. These variants address limitations in generic models by tailoring substitution probabilities to particular genomic contexts, improving accuracy in phylogenetic inference and evolutionary analysis. Transition-transversion matrices explicitly weight purine-to-purine (A↔G) or pyrimidine-to-pyrimidine (C↔T) transitions higher than purine-to-pyrimidine transversions, reflecting observed mutational biases in DNA evolution. This is achieved through parameters like kappa (κ), which scales the rate of transitions relative to transversions in extensions of the Jukes-Cantor model, such as the Hasegawa-Kishino-Yano (HKY85) model. In HKY85, the instantaneous rate matrix Q incorporates base frequencies π and κ, where off-diagonal entries for transitions are multiplied by κ, while transversions receive a base rate, capturing the twofold higher likelihood of transitions in many lineages. For example, assuming equal base frequencies (π_A = π_C = π_G = π_T = 0.25) and κ = 2, a simplified 4x4 rate matrix (scaled such that rows sum to zero) penalizes transversions as follows:

Q = [ [-1.0, 0.5, 0.25, 0.25], [0.5, -1.0, 0.25, 0.25], [0.25, 0.25, -1.0, 0.5], [0.25, 0.25, 0.5, -1.0] ]

Q = [ [-1.0, 0.5, 0.25, 0.25], [0.5, -1.0, 0.25, 0.25], [0.25, 0.25, -1.0, 0.5], [0.25, 0.25, 0.5, -1.0] ]

Here, transitions (e.g., A to G) have rate 0.5, while transversions (e.g., A to C) have rate 0.25, demonstrating the penalty. These matrices differ from substitution matrices in their smaller 4-letter , leading to simpler parameterizations despite similar log-odds construction principles. Codon-based matrices operate on a 64x64 (or 61x61, excluding stop codons) framework to model substitutions at the triplet level, distinguishing synonymous changes (which preserve ) from nonsynonymous ones (which alter them). The Goldman-Yang model (GY94) exemplifies this by integrating transition-transversion bias, codon usage frequencies, and physicochemical distances to constrain nonsynonymous rates, enabling estimation of the dN/dS ratio (ω), where ω > 1 indicates positive selection and ω < 1 purifying selection. This approach outperforms nucleotide-level models for detecting selective pressures in protein-coding genes, as it accounts for degenerate codon positions. RNA-specific matrices, such as RIBOSUM, adapt to structured non-protein-coding RNAs by incorporating base- interactions from secondary structures, using an expanded alphabet that includes paired symbols (e.g., Watson-Crick pairs like A-U). Derived empirically from aligned sequences with known structures, RIBOSUM matrices estimate substitution rates across single-stranded and paired regions, where compensatory changes in stems (e.g., G-C to A-U) are favored to maintain stability. These matrices, available at varying levels (e.g., RIBOSUM-45 for closely related sequences), enhance alignment accuracy and phylogenetic reconstruction for RNAs like tRNAs and rRNAs. In the 2020s, specialized models have emerged for viral genomes, particularly , integrating observed mutation spectra like C-to-U overrepresentation due to host editing. Time-irreversible substitution matrices for extend standard frameworks with direction-specific rates, fitting the virus's ∼1.1 × 10^{-3} substitutions/site/year rate and aiding variant tracking across millions of sequences. Codon models have similarly been applied to , revealing ω values near 1 in spike genes, indicating near-neutral drift with episodic positive selection at key sites. These variants find key applications in tracking , where codon-based dN/dS analyses detect adaptive mutations in pathogens like and , outperforming models by resolving synonymous constraints. For non-coding RNAs, RIBOSUM matrices improve secondary structure prediction and alignment in functional elements like ribozymes, offering advantages over standard models by preserving structural covariation and reducing misalignment in paired regions.

Terminology

Key Concepts

A substitution score, denoted as S(i,j)S(i,j), represents the numerical value assigned to the alignment of two residues ii and jj, reflecting the relative favorability of that substitution based on evolutionary observations or physicochemical properties. These scores are typically derived from empirical data and form the core of pairwise algorithms, where higher positive values indicate more likely or conservative alignments, while negative values penalize unlikely pairings. In protein , substitutions are classified as conservative when an is replaced by another with similar biochemical properties, such as (Asp) to (Glu), both negatively charged residues that preserve protein function and structure. Conversely, radical substitutions involve dissimilar residues, like Asp to (Val), where an acidic residue changes to a hydrophobic one, often disrupting stability or interactions and occurring less frequently in . The twilight zone refers to the range of sequence identity below 30%, typically 20-30%, where standard alignment methods struggle to detect homology due to insufficient signal from conserved residues, making substitution matrices essential for inferring distant relationships. Substitution matrices integrate with gap penalties in sequence alignment scoring by contributing match/mismatch scores to the total alignment score, while affine gap costs—consisting of an opening penalty and an extension penalty—account for insertions or deletions; the overall score is the sum of substitution scores minus these gap costs, optimized via dynamic programming to balance aligned residues against indels. This interaction ensures that alignments favor biologically plausible evolutionary events without over-penalizing structural variations. The Dayhoff mutation metric, foundational to early substitution models, quantifies evolutionary change as the observed frequency of accepted point mutations in closely related protein sequences, serving as a basis for estimating substitution probabilities. Block clustering is a technique used in deriving matrices like , where sequence segments from conserved protein blocks are grouped at specified identity thresholds (e.g., 62% for BLOSUM62) to reduce bias from closely related sequences and capture substitution patterns across diverse evolutionary distances. Evolutionary distance units, such as PAM (percent accepted ) units, measure as the expected number of accepted mutations per 100 residues, with 1 PAM unit corresponding to 1% change, allowing matrices to be scaled for different levels. Substitution scores in these matrices often take a log-odds form to compare observed alignments against random expectations.

Notation Conventions

In the literature on substitution matrices, the score assigned to the alignment of residues ii and jj is commonly denoted as S(i,j)S(i,j) or Score(i,j)\text{Score}(i,j), reflecting the log-odds ratio of observed substitution frequencies to expected random frequencies. The probability matrix, which captures the likelihood of one residue changing to another over evolutionary time, is typically represented as MM, with entries Mi,jM_{i,j}. Background frequencies of individual residues are standardly symbolized as QiQ_i or fif_i, denoting the probability of occurrence in a reference . These symbols originate from foundational models in protein and are widely adopted in bioinformatics tools and analyses. Substitution matrices are constructed as square N×NN \times N arrays, where N=20N=20 for the 20 standard and N=4N=4 for (A, C, G, T/U). Rows and columns follow a conventional ordering, most often alphabetical for amino acids: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V, ensuring consistent indexing across implementations. This structure facilitates direct lookup of scores during sequence comparison. Scaling conventions express scores in information-theoretic units, with natural log-odds often converted to bits via base-2 logarithms (log2\log_2) for interpretability; a factor of 2log22 \log_2 is frequently applied to produce half-bit scores, enabling positive values suitable for computation. In practical tools like BLAST, matrix entries are rounded to the nearest to optimize storage and performance while preserving relative scoring. Software and database implementations standardize matrix names for : BLOSUM variants are denoted as "blosumN" (e.g., "blosum62" for the matrix derived from blocks with ≤62% identity), while PAM matrices use "pamN" (e.g., "pam250" for 250 accepted mutations per 100 residues). These conventions appear in command-line options and configuration files for alignment programs. For instance, the substitution score between (Leu or L) and (Ile or I) in the BLOSUM62 matrix is notated as S(Leu,Ile)=+2S(\text{Leu}, \text{Ile}) = +2 or S(L,I)=+2S(\text{L}, \text{I}) = +2, illustrating how three-letter or one-letter codes index the matrix entries. Such notation supports precise referencing in alignment algorithms for scoring pairwise matches.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.