Recent from talks
Nothing was collected or created yet.
Substitution matrix
View on WikipediaThis article includes a list of general references, but it lacks sufficient corresponding inline citations. (October 2009) |
In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.[1]
Background
[edit]In the process of evolution, from one generation to the next the amino acid sequences of an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence
ALEIRYLRD
could mutate into the sequence
ALEINYLRD
in one step, and possibly
AQEINYQRD
over a longer period of evolutionary time. Each amino acid is more or less likely to mutate into various other amino acids. For instance, a hydrophilic residue such as arginine is more likely to be replaced by another hydrophilic residue such as glutamine, than it is to be mutated into a hydrophobic residue such as leucine. (Here, a residue refers to an amino acid stripped of a hydrogen and/or a hydroxyl group and inserted in the polymeric chain of a protein.) This is primarily due to redundancy in the genetic code, which translates similar codons into similar amino acids. Furthermore, mutating an amino acid to a residue with significantly different properties could affect the folding and/or activity of the protein. This type of disruptive substitution is likely to be removed from populations by the action of purifying selection because the substitution has a higher likelihood of rendering a protein nonfunctional.[2]
If we have two amino acid sequences in front of us, we should be able to say something about how likely they are to be derived from a common ancestor, or homologous. If we can line up the two sequences using a sequence alignment algorithm such that the mutations required to transform a hypothetical ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to assign a high score to the comparison of the sequences.
To this end, we will construct a 20x20 matrix where the th entry is equal to the probability of the th amino acid being transformed into the th amino acid in a certain amount of evolutionary time. There are many different ways to construct such a matrix, called a substitution matrix. Here are the most commonly used ones:
Identity matrix
[edit]The simplest possible substitution matrix would be one in which each amino acid is considered maximally similar to itself, but not able to transform into any other amino acid. This matrix would look like
This identity matrix will succeed in the alignment of very similar amino acid sequences but will be miserable at aligning two distantly related sequences. We need to figure out all the probabilities in a more rigorous fashion. It turns out that an empirical examination of previously aligned sequences works best.
Log-odds matrices
[edit]We express the probabilities of transformation in what are called log-odds scores. The scores matrix S is defined as
where is the probability that amino acid transforms into amino acid , and , are the frequencies of amino acids i and j. The base of the logarithm is not important, and the same substitution matrix is often expressed in different bases.
Example amino-acid matrices
[edit]
PAM
[edit]One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. Because the use of very closely related homologs, the observed mutations are not expected to significantly change the common functions of the proteins. Thus the observed substitutions (by point mutations) are considered to be accepted by natural selection.
One PAM unit is defined as 1% of the amino acid positions that have been changed. To create a PAM1 substitution matrix, a group of very closely related sequences with mutation frequencies corresponding to one PAM unit is chosen. Based on collected mutational data from this group of sequences, a substitution matrix can be derived. This PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed.
The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. With this assumption, the PAM2 matrix can estimated by squaring the probabilities. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70 are used.
BLOSUM
[edit]Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem. Henikoff & Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins and will therefore have lower substitution rates than less conserved regions. To reduce bias from closely related sequences on substitution rates, segments in a block with a sequence identity above a certain threshold were clustered, reducing the weight of each such cluster (Henikoff and Henikoff). For the BLOSUM62 matrix, this threshold was set at 62%. Pairs frequencies were then counted between clusters, hence pairs were only counted between segments less than 62% identical. One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.
It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST.
It also turns out the BLOSUM computer code written by Henikoff and Henikoff does not exactly match the description in their paper. Surprisingly, this commonly used "wrong" version has better search performance.[3]
Differences between PAM and BLOSUM
[edit]- PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree: maximum parismony), whereas the BLOSUM matrices are based on an implicit model of evolution.
- The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
- The method used to count the replacements is different: unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same.
- Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than BLOSUM50.
Newer matrices
[edit]A number of newer substitution matrices have been proposed to deal with inadequacies in earlier designs.
- JTT (1992). Published in the same year as BLOSOM, it also performs clustering and uses an implicit model. This may help reduce the systematic error from maximum parismony (MP), but also wastes sequence information.[4]
- VTML (2001), a PAM-like matrix based on the alignments in the SYSTERS database, iteratively improved using a maximum likelihood estimator starting from the 1970s Dayhoff PAM model.[5]
- WAG (Wheelan And Goldman, 2001) uses a maximum likelihood estimating procedure instead of any form of MP over a "BRKALN" dataset. The substitution scores are calculated based on the likelihood of a change considering multiple tree topologies derived using neighbor-joining. The scores correspond to an substitution model which includes also amino-acid stationary frequencies and a scaling factor in the similarity scoring. There are two versions of the matrix: WAG matrix based on the assumption of the same amino-acid stationary frequencies across all the compared protein and WAG* matrix with different frequencies for each of included protein families.[4]
- PMB (Probability Matrix from Blocks, 2003), a set of "true" substitution frequencies estimated from the observed frequencies of BLOSUM, taking into account the possibility of a later substitution masking a previous one. It thus creates a evolutionary model where the distances have theoretical meaning (BLOSUM does not have this feature, unlike PAM, WAG, and most other later matrices, and hence is not recommended for phylogeny by IQ-TREE).[6]
- LG (2008), which uses a larger dataset (Pfam-based) than WAG. An extension of the WAG algorithm is used, with a new PhyML (WAG+Γ4) model taking into account of sites with different evolutionary rates.[7]
- Qmaker and nQmaker (2021, 2022), programs with the ability to estimate time-reversible and nonreversible matrices from very large datasets quickly. Each provide a general matrix and 5 specialized matrices, for a total of 12 precalculated substitution matrices.[8][9]
- Matrices using a selection of proteins based on structual relatedness, as proposed by Benner et al. (1994), Fan (2004), and Steven et al. (2004).[5]
- Matrices using structual alignments of proteins instead of simple sequence alignment (6 separate publications).[5]
- Matrices using known physiochemical parameters of amino acid residues (5 separate publications).[5]
For a list of more models (including irreversible i.e. asymmetric and/or specialized ones), see the documentation for recent bioinformatic software including IQ-Tree,[10] PhyML,[11] and RAxML.[12]
Specialized substitution matrices and their extensions
[edit]The real substitution rates in a protein depends not only on the identity of the amino acid, but also on the specific structural or sequence context it is in. Many specialized matrices have been developed for these contexts, such as in transmembrane alpha helices,[13] for combinations of secondary structure states and solvent accessibility states,[14][15][16] or for local sequence-structure contexts.[17] These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used.
Since the 2000s, an increasing amount of matrices are defined for subsets of proteins not optimally aligned by traditional "general-purpose" matrices. These include:[5]
- PfSSM (2008), CBM and CCF (2008) for Plasmodium proteins, which have a different amino acid evolutionary bias due to the low GC content of the genome.
- Matrices for transmembrane proteins. JTT transmembrane (1994) is the first of the class. Later work include:
- For alpha-helical transmembrane proteins, PHAT (2000) and SLIM (2001).
- For beta-barrel transmembrane proteins, bbTM (2008).
- Matrices for a specific protein family, including GPCRtm (2015) for the transmembrane (mostly helical) regions of GPCRs.
- Matrices for proteins with a specific role, including Hubsm (2017) for "hub proteins" in protein‐protein interaction networks.
- Matrices for intrinsically disordered proteins, including DUNMat (2002), MidicMat (2009), Disorder (2010), and EDSSMat (2019).
Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular BLAST program has been demonstrated to achieve a twofold sensitivity improvement for remotely related sequences over BLAST at similar speeds (CS-BLAST).
Nucleotide matrices
[edit]With nucleotides only having four possible values (in most bioinformatic sequences), the emphasis lies not in setting fixed values in the matrix, but in designing parameterized models that fit the observed evolution of the input sequence as it's being aligned. See Models of DNA evolution.
Terminology
[edit]Although "transition matrix" is often used interchangeably with "substitution matrix" in fields other than bioinformatics, the former term is problematic in bioinformatics. With regards to nucleotide substitutions, "transition" is also used to indicate those substitutions that are between the two-ring purines (A → G and G → A) or are between the one-ring pyrimidines (C → T and T → C). Because these substitutions do not require a change in the number of rings, they occur more frequently than the other substitutions. "Transversion" is the term used to indicate the slower-rate substitutions that change a purine to a pyrimidine or vice versa (A ↔ C, A ↔ T, G ↔ C, and G ↔ T).
See also
[edit]References
[edit]- ^ Zvelebil, Marketa J. (2008). Understanding bioinformatics. New York: Garland Science. pp. 117–127, 747. ISBN 978-0-8153-4024-9.
- ^ Xiong, Jin (2006). Essential Bioinformatics. Cambridge: Cambridge University Press. doi:10.1017/cbo9780511806087.004. ISBN 978-0-511-80608-7.
- ^ Styczynski, Mark P; Jensen, Kyle L; Rigoutsos, Isidore; Stephanopoulos, Gregory (March 2008). "BLOSUM62 miscalculations improve search performance". Nature Biotechnology. 26 (3): 274–275. doi:10.1038/nbt0308-274. PMID 18327232. S2CID 205266180.
- ^ a b Whelan, Simon; Goldman, Nick (1 May 2001). "A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach". Molecular Biology and Evolution. 18 (5): 691–699. doi:10.1093/oxfordjournals.molbev.a003851. ISSN 0737-4038. PMID 11319253.
- ^ a b c d e Trivedi, R; Nagarajaram, HA (November 2020). "Substitution scoring matrices for proteins - An overview". Protein Science. 29 (11): 2150–2163. doi:10.1002/pro.3954. PMC 7586916. PMID 32954566.
- ^ Veerassamy, Shalini; Smith, Andrew; Tillier, Elisabeth R. M. (December 2003). "A Transition Probability Model for Amino Acid Substitutions from Blocks". Journal of Computational Biology. 10 (6): 997–1010. doi:10.1089/106652703322756195. PMID 14980022.
- ^ Le, S. Q.; Gascuel, O. (3 April 2008). "An Improved General Amino Acid Replacement Matrix". Molecular Biology and Evolution. 25 (7): 1307–1320. doi:10.1093/molbev/msn067. PMID 18367465.
- ^ Minh, Bui Quang; Dang, Cuong Cao; Vinh, Le Sy; Lanfear, Robert (11 August 2021). "QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution". Systematic Biology. 70 (5): 1046–1060. doi:10.1093/sysbio/syab010. PMC 8357343. PMID 33616668.
- ^ Dang, Cuong Cao; Minh, Bui Quang; McShea, Hanon; Masel, Joanna; James, Jennifer Eleanor; Vinh, Le Sy; Lanfear, Robert (10 August 2022). "nQMaker: Estimating Time Nonreversible Amino Acid Substitution Models". Systematic Biology. 71 (5): 1110–1123. doi:10.1093/sysbio/syac007. PMC 9366462. PMID 35139203.
- ^ "Substitution Models". iqtree.github.io.
- ^ "phyml/doc/phyml-manual.pdf at master · stephaneguindon/phyml" (PDF). GitHub.
- ^ Stamatakis, Alexandros (July 20, 2016). "The RAxML v8.2.X Manual" (PDF).
- ^ Müller, T; Rahmann, S; Rehmsmeier, M (2001). "Non-symmetric score matrices and the detection of homologous transmembrane proteins". Bioinformatics. 17 (Suppl 1): S182–9. doi:10.1093/bioinformatics/17.suppl_1.s182. PMID 11473008.
- ^ Rice, DW; Eisenberg, D (1997). "A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence". Journal of Molecular Biology. 267 (4): 1026–38. CiteSeerX 10.1.1.44.1143. doi:10.1006/jmbi.1997.0924. PMID 9135128.
- ^ Gong, Sungsam; Blundell, Tom L. (2008). Levitt, Michael (ed.). "Discarding functional residues from the substitution table improves predictions of active sites within three-dimensional structures". PLOS Computational Biology. 4 (10) e1000179. Bibcode:2008PLSCB...4E0179G. doi:10.1371/journal.pcbi.1000179. PMC 2527532. PMID 18833291.
- ^ Goonesekere, NC; Lee, B (2008). "Context-specific amino acid substitution matrices and their use in the detection of protein homologs". Proteins. 71 (2): 910–9. doi:10.1002/prot.21775. PMID 18004781. S2CID 27443393.
- ^ Huang, YM; Bystroff, C (2006). "Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions". Bioinformatics. 22 (4): 413–22. doi:10.1093/bioinformatics/bti828. PMID 16352653.
Further reading
[edit]- Altschul, SF (1991). "Amino acid substitution matrices from an information theoretic perspective". Journal of Molecular Biology. 219 (3): 555–65. doi:10.1016/0022-2836(91)90193-A. PMC 7130686. PMID 2051488.
- Dayhoff, M. O.; Schwartz, R. M.; Orcutt, B. C. (1978). "A model of evolutionary change in proteins". Atlas of Protein Sequence and Structure. 5 (3): 345–352.
- Eddy, SR (2004). "Where did the BLOSUM62 alignment score matrix come from?". Nature Biotechnology. 22 (8): 1035–6. doi:10.1038/nbt0804-1035. PMID 15286655. S2CID 205269887.
- Henikoff, S; Henikoff, JG (1992). "Amino acid substitution matrices from protein blocks". Proceedings of the National Academy of Sciences of the United States of America. 89 (22): 10915–9. Bibcode:1992PNAS...8910915H. doi:10.1073/pnas.89.22.10915. PMC 50453. PMID 1438297.
External links
[edit]Substitution matrix
View on GrokipediaBackground
Definition and Purpose
A substitution matrix is a square array of scores that quantifies the favorability or likelihood of one biological residue, such as an amino acid or nucleotide, replacing another during evolutionary processes in aligned sequences.[3] These scores are derived from observed substitution frequencies in related sequences, reflecting the relative ease of mutations based on chemical, physical, and biological properties of the residues.[1] The primary purpose of substitution matrices is to score alignments in bioinformatics algorithms, such as BLAST and Smith-Waterman, by assigning positive values to conservative substitutions that preserve function and negative values to unlikely ones, thus distinguishing evolutionarily related sequences from random similarities. This enables the detection of distant homologs by modeling evolutionary relatedness rather than exact matches.[1] A common implementation is the log-odds matrix, which compares observed substitution probabilities to those expected under random chance.[1] Substitution matrices improve the accuracy of pairwise and multiple sequence alignments, which form the basis for phylogenetic analysis by informing substitution models in tree construction.[4] They also support protein structure prediction by enhancing homology detection and template alignment in comparative modeling approaches.[1] The concept originated in the 1970s for modeling protein evolution through empirical substitution data.Historical Development
The development of substitution matrices began in the 1970s with pioneering work by Margaret Dayhoff and her colleagues at the National Biomedical Research Foundation, who analyzed protein mutation probabilities to create the first quantitative models for amino acid substitutions.[1] Dayhoff's approach focused on estimating evolutionary changes through observed mutations in closely related protein sequences, laying the groundwork for matrices that could score alignments based on biological realism rather than arbitrary penalties.[5] This effort culminated in the publication of the Atlas of Protein Sequence and Structure in 1978, a comprehensive compendium that included the initial Point Accepted Mutation (PAM) matrices derived from 1,572 mutations across 71 protein families, marking a key milestone in bioinformatics by standardizing protein evolution modeling.[6] In the 1980s and 1990s, advancements shifted toward more robust matrices suited for diverse evolutionary distances and alignment types. A significant innovation came in 1992 when Steven and Jorja Henikoff introduced the BLOcks SUbstitution Matrix (BLOSUM) series, derived from conserved blocks in protein families to capture local alignment patterns, contrasting with the global alignments emphasized in earlier models.[7] This methodology improved sensitivity for detecting distant homologs, and by the late 1990s, BLOSUM62 emerged as the default scoring matrix in the Basic Local Alignment Search Tool (BLAST), enhancing the accuracy of database searches for protein similarities. The 2000s saw expansions into context-specific substitution matrices tailored to structural or functional features, such as secondary structures, to refine alignment scores in specialized scenarios.[8] Concurrently, nucleotide substitution matrices gained prominence in molecular evolution studies, with models like the general time-reversible (GTR) matrix enabling better estimation of substitution rates across DNA sequences for phylogenetic inference.[9] Post-2010 trends integrated structural data from the Protein Data Bank (PDB) with machine learning techniques to develop structure-aware matrices that account for three-dimensional conformations.[10] Recent developments from 2020 to 2025, inspired by deep learning advancements like AlphaFold for protein structure prediction, have further advanced this by incorporating predicted folds into substitution scoring for improved evolutionary modeling.[11] Notable examples include the 3Di matrix, introduced in 2025, which uses structural alignments for phylogenetics to resolve deep evolutionary relationships previously intractable with sequence-based methods alone.[12]Basic Substitution Matrices
Identity Matrix
The identity matrix represents the most basic substitution matrix in bioinformatics, assigning positive scores to identical residues and zero or negative scores to non-identical ones, thereby prioritizing exact matches without considering any biological context for changes.[13] It is structured as a diagonal matrix, where the main diagonal elements are positive (commonly +1) and all off-diagonal elements are zero or negative (e.g., 0 or -1).[14] For amino acid sequence alignment, the identity matrix is a 20×20 symmetric matrix with +1 scores on the diagonal for the 20 standard amino acids and 0 elsewhere, ensuring that only exact matches contribute positively to the alignment score.[15] Similarly, for nucleotide sequences, it takes the form of a 4×4 matrix over the alphabet {A, T, G, C}, with the same diagonal pattern. An example of this nucleotide identity matrix is shown below, where matches score +1 and mismatches score 0:| A | T | G | C | |
|---|---|---|---|---|
| A | 1 | 0 | 0 | 0 |
| T | 0 | 1 | 0 | 0 |
| G | 0 | 0 | 1 | 0 |
| C | 0 | 0 | 0 | 1 |
Simple Scoring Schemes
Simple scoring schemes in sequence alignment predate empirical substitution matrices and rely on fixed, rule-based assignments rather than observed evolutionary frequencies. One of the most basic approaches is the match-mismatch scoring system, where identical residues receive a fixed positive score, such as +5, and any non-identical substitution incurs a uniform negative penalty, such as -4, regardless of the specific residues involved.[17] This scheme treats all matches equally and all mismatches equivalently, simplifying the alignment process without considering biological context.[18] Another rudimentary method involves chemical similarity schemes, which assign scores based on ad-hoc assessments of physicochemical properties like polarity, volume, or composition. For instance, Grantham's 1974 distance formula calculates differences between amino acids using three properties—side-chain composition, polarity, and volume—to derive a dissimilarity score, with lower values indicating more similar residues suitable for conservative substitutions.[19] Similarly, Miyata et al. in 1979 proposed a matrix distinguishing two substitution types: conservative changes preserving volume and polarity, scored more favorably, and radical changes disrupting these properties, scored lower. These schemes aim to reflect biochemical compatibility but rely on heuristic groupings rather than data-driven probabilities. Such simple scoring approaches were prevalent in early computational alignments during the 1960s and 1970s, before the advent of empirical matrices. The Needleman-Wunsch algorithm, introduced in 1970, employed a basic match score of +1 for identical amino acids and effectively 0 for mismatches, focusing on maximizing the number of matches in global alignments of proteins like cytochrome c and hemoglobins.[20] This era's methods, including manual and initial automated alignments, often incorporated physicochemical heuristics to guide substitutions, as limited sequence data precluded statistical modeling. The identity matrix can be viewed as a special case of match-mismatch scoring where mismatches receive a score of 0. These schemes offer advantages in computational simplicity and speed, making them suitable for initial implementations on limited hardware of the time, and they provide an intuitive framework for penalizing dissimilar alignments. However, they suffer significant drawbacks by ignoring varying evolutionary substitution rates—such as the higher likelihood of transitions over transversions in nucleotides or conservative changes in proteins—leading to suboptimal detection of distant homologs compared to empirical matrices like PAM or BLOSUM.[1] For nucleotides, a typical simple scoring table applies a uniform mismatch penalty, as shown below:| A | C | G | T | |
|---|---|---|---|---|
| A | +1 | -4 | -4 | -4 |
| C | -4 | +1 | -4 | -4 |
| G | -4 | -4 | +1 | -4 |
| T | -4 | -4 | -4 | +1 |
Construction of Log-Odds Matrices
Theoretical Principles
Substitution matrices are constructed from observed frequencies of amino acid or nucleotide substitutions in closely related, aligned biological sequences, capturing the evolutionary acceptability of such changes by quantifying how often specific pairs align compared to random expectation.[21] These frequencies reflect patterns of molecular evolution, where substitutions that preserve protein function or structure occur more frequently than those that disrupt it.[1] The scoring of substitutions relies on comparing target frequencies—derived from aligned sequences—to background frequencies, which represent the expected occurrence of residues in unrelated sequences. This approach leverages mutual information between residue pairs, measuring the dependency or shared information that indicates evolutionary relatedness beyond chance.[21] Substitutions are thus scored relative to random chance, with positive scores for likely evolutionary alignments and negative for improbable ones.[1] Underlying these matrices are evolutionary models that assume a Markov process for mutations, where the probability of a substitution depends only on the current state and accumulates over time without memory of prior changes.[1] This framework distinguishes conservative substitutions, which involve chemically similar residues and occur more frequently to maintain functional constraints, from radical ones that alter properties and are rarer.[21] The general scoring function introduces a log-odds ratio, expressed as , where is the joint frequency of residues and in alignments, and , are their marginal frequencies; this provides a probabilistic measure of substitution likelihood.[21] In sequence alignment algorithms, these scores integrate into dynamic programming frameworks, such as the Needleman-Wunsch method, to evaluate pairwise matches against gap penalties and optimize overall alignment scores for detecting homology.[1] Empirical implementations like PAM and BLOSUM matrices apply this theoretical foundation to real data for practical use in bioinformatics tools.[21]Mathematical Derivation
The construction of log-odds substitution matrices begins with defining key probabilities derived from empirical data. The target frequency represents the observed probability that residues and are aligned in a set of biologically related sequences, estimated by counting aligned pairs across multiple alignments and normalizing by the total number of aligned pairs.[22] The background frequency (or ) is the marginal probability of residue occurring in unrelated or random sequences, typically computed as the average frequency across a large reference set of sequences, such as .[22] The core log-odds score quantifies the relative likelihood of observing the substitution to under an evolutionary model versus random chance. It is derived as the logarithm of the odds ratio: where the denominator assumes independence between residues under the null (random) model, and is a scaling constant.[22] This formula arises from the likelihood ratio test in statistical alignment: the probability of the data given related sequences () divided by the probability under independence (), with the log ensuring additivity of scores over sequence positions.[23] Variants include the natural logarithm () for computational convenience, base-10 logarithm scaled by 10 (as in early models to yield integer-like values around 0 to 20), or base-2 logarithm for scores in bits (information-theoretic units).[6] For practical use, scaling to full bits () or half-bits () followed by rounding to integers ensures efficient storage and negative expected scores for random alignments (typically around -0.5 to -4 bits per position).[22] Normalization ensures the matrix's utility in alignment algorithms. Symmetry () is achieved because is undirected in pairwise counts from alignments, though asymmetric forms exist for directed evolutionary models. Diagonals are positive since matches () are more probable than random, reflecting conservation. For low-frequency substitutions where , pseudocounts (small uniform additions, e.g., to counts) are incorporated to prevent undefined logs and smooth estimates: adjusted for a 20-residue alphabet, reducing overfitting in sparse data.[23] The derivation relies on several assumptions. Positions in sequences evolve independently, allowing additive log scores without inter-position dependencies. Background frequencies are stationary, remaining constant over evolutionary time despite overall composition shifts. Time-scaling accounts for evolutionary distance: target frequencies are specific to a divergence level (e.g., 1% accepted mutations), and for greater distances, a base mutation probability matrix is raised to a power (e.g., via matrix exponentiation) before computing logs, modeling progressive divergence.[22] These assumptions enable the matrix to approximate a continuous-time Markov process for substitutions.[6] Computationally, the matrix is built in four main steps. First, curate and align closely related sequences (e.g., proteins sharing >85% identity) to capture accepted mutations. Second, count aligned pairs: for each column pair in the alignments, increment . Third, estimate frequencies: and , optionally applying pseudocounts. Fourth, compute scores: and round to integers.[22] As an illustrative example, consider a toy alignment dataset with a reduced alphabet {A, B} and 100 aligned pairs yielding counts: 60 AA, 10 AB, 10 BA, 20 BB. Then, Using natural log scaling (), This shows positive matches and negative mismatches.[23] Pseudocode for the derivation (in Python-like syntax) is as follows:# Step 1-2: Align and count (simplified)
counts = initialize_matrix(alphabet_size, zero=True)
for alignment in related_alignments:
for pos1, pos2 in pairwise_positions(alignment):
i, j = residue_at(pos1), residue_at(pos2)
counts[i][j] += 1
counts[j][i] += 1 # For symmetry
# Step 3: Estimate frequencies with pseudocounts (alpha=0.0001)
total_pairs = sum_all(counts)
alpha = 0.0001 * len([alphabet](/page/Alphabet))
for i in [alphabet](/page/Alphabet):
for j in [alphabet](/page/Alphabet):
M[i][j] = (counts[i][j] + alpha) / (total_pairs + len([alphabet](/page/Alphabet))**2 * alpha)
Q[i] = sum_j M[i][j]
# Step 4: Compute log-odds (k=2 / ln(2) for half-bits)
import math
k = 2 / math.log(2)
for i in alphabet:
for j in alphabet:
if M[i][j] > 0:
S[i][j] = round(k * math.log(M[i][j] / (Q[i] * Q[j])))
else:
S[i][j] = round(k * math.log(1e-10)) # Floor for zeros
# Step 1-2: Align and count (simplified)
counts = initialize_matrix(alphabet_size, zero=True)
for alignment in related_alignments:
for pos1, pos2 in pairwise_positions(alignment):
i, j = residue_at(pos1), residue_at(pos2)
counts[i][j] += 1
counts[j][i] += 1 # For symmetry
# Step 3: Estimate frequencies with pseudocounts (alpha=0.0001)
total_pairs = sum_all(counts)
alpha = 0.0001 * len([alphabet](/page/Alphabet))
for i in [alphabet](/page/Alphabet):
for j in [alphabet](/page/Alphabet):
M[i][j] = (counts[i][j] + alpha) / (total_pairs + len([alphabet](/page/Alphabet))**2 * alpha)
Q[i] = sum_j M[i][j]
# Step 4: Compute log-odds (k=2 / ln(2) for half-bits)
import math
k = 2 / math.log(2)
for i in alphabet:
for j in alphabet:
if M[i][j] > 0:
S[i][j] = round(k * math.log(M[i][j] / (Q[i] * Q[j])))
else:
S[i][j] = round(k * math.log(1e-10)) # Floor for zeros
Amino Acid Substitution Matrices
PAM Matrices
Point Accepted Mutation (PAM) matrices, the first empirical substitution matrices for proteins, were developed by Margaret Dayhoff and colleagues in 1978 using data from 71 closely related protein families spanning 34 superfamilies, with a total of 1,572 observed accepted point mutations.[24] These matrices model evolutionary changes based on the concept of "accepted point mutations," where 1 PAM (PAM1) represents an average evolutionary distance at which 1% of amino acids have changed.[24] The construction begins with a mutation probability matrix (M) derived from alignments of closely related sequences, where entries indicate the probability that amino acid mutates to over 1 PAM unit, accounting for relative mutabilities (e.g., asparagine at 134, tryptophan at 18) and amino acid frequencies (e.g., glycine at 0.089, tryptophan at 0.010).[24] For greater evolutionary distances, the PAMn matrix is obtained by raising the PAM1 matrix to the power via matrix multiplication, allowing extrapolation to scenarios like PAM250, where sequences have diverged by 250% (retaining about 20% identity).[24] This probability matrix is then converted to a log-odds scoring matrix, where scores are , with as the background frequency of , rounded to integers for practical use in alignments.[24] PAM matrices are 20×20 symmetric tables that assign higher positive scores to conservative substitutions reflecting chemical or structural similarity, such as Ile to Val, while penalizing dissimilar changes.[24] They were instrumental in early bioinformatics tools, including the FASTA program, which employed PAM matrices for rapid sequence similarity searches. Common variants include PAM120, suitable for general database searches with sequences around 40% identical, and PAM30 for detecting close homologs with higher identity levels.[24] For illustration, a snippet of the PAM250 log-odds matrix highlights these patterns:| W (Trp) | G (Gly) | I (Ile) | V (Val) | |
|---|---|---|---|---|
| W (Trp) | +11 | -4 | -3 | -3 |
| G (Gly) | -4 | +7 | -4 | -3 |
| I (Ile) | -3 | -4 | +5 | +4 |
| V (Val) | -3 | -3 | +4 | +4 |
BLOSUM Matrices
The BLOSUM (BLOcks SUbstitution Matrix) family of substitution matrices was developed by Steven and Jorja Henikoff in 1992 as an empirical approach to scoring amino acid substitutions in protein sequence alignments. Unlike earlier models that relied on evolutionary extrapolations from closely related sequences, BLOSUM matrices are derived directly from observed substitutions in conserved protein blocks extracted from the BLOCKS database, which contains aligned segments of related proteins without gaps. This method uses approximately 2,000 such blocks from over 500 groups of proteins, providing a robust empirical basis for scoring.[25] To construct a BLOSUM matrix, sequences within each block are clustered using a greedy algorithm at specified percent identity thresholds to reduce bias from overrepresented sequences; for example, BLOSUM62 clusters sequences sharing 62% or greater identity, while BLOSUM80 uses a stricter 80% threshold for closely related sequences. Observed frequencies of amino acid pairs in these clustered alignments are then used to compute log-odds scores, defined as , where is the observed frequency of substitution between amino acids and , and and are their background frequencies, without any evolutionary extrapolation. This results in a 20×20 symmetric matrix scaled in half-bit units, making it particularly effective for local alignments in tools like BLAST. In contrast to PAM matrices, which are based on global alignments of closely related homologs from smaller datasets, BLOSUM draws from diverse, local block alignments.[25] Key features of BLOSUM matrices include positive scores for biologically likely substitutions and negative scores for unlikely ones, reflecting relative frequencies; for instance, the substitution of aspartic acid (D) for glutamic acid (E), which are chemically similar, scores +2, while leucine (L) to isoleucine (I) also scores +2 due to their hydrophobic similarity. BLOSUM62 has become the default matrix in NCBI's BLAST program for protein searches, balancing sensitivity for distant relationships. Variants are tailored to evolutionary distances: BLOSUM45 is used for detecting more divergent proteins, while BLOSUM90 suits highly similar sequences by emphasizing subtle differences.[25][26] An excerpt from the BLOSUM62 matrix illustrates these scores (rows and columns ordered alphabetically by single-letter amino acid codes: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V):| A | R | N | D | C | Q | E | G | H | I | L | K | M | F | P | S | T | W | Y | V | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 4 | -1 | -2 | -2 | 0 | -1 | -1 | 0 | -2 | -1 | -1 | -1 | -1 | -2 | -1 | 1 | 0 | -3 | -2 | 0 |
| D | -2 | -2 | 1 | 6 | -3 | 0 | 2 | -1 | -1 | -3 | -4 | -1 | -3 | -3 | -1 | 0 | -1 | -4 | -3 | -3 |
| E | -1 | 0 | 0 | 2 | -4 | 2 | 5 | -2 | 0 | -3 | -3 | 1 | -2 | -3 | -1 | 0 | -1 | -3 | -2 | -2 |
| I | -1 | -3 | -3 | -3 | -1 | -3 | -3 | -4 | -3 | 4 | 2 | -3 | 1 | 0 | -3 | -2 | -1 | -3 | -1 | 3 |
| L | -1 | -2 | -3 | -4 | -1 | -2 | -3 | -4 | -3 | 2 | 4 | -2 | 2 | 0 | -3 | -2 | -1 | -2 | -1 | 1 |
Differences Between PAM and BLOSUM
The Point Accepted Mutation (PAM) and Blocks Substitution Matrix (BLOSUM) represent two foundational approaches to amino acid substitution scoring, differing fundamentally in their data foundations and methodological assumptions. PAM matrices are constructed from a small, manually curated set of 71 closely related protein families exhibiting at least 85% sequence identity, capturing 1,572 observed mutations to model evolutionary changes in closely related sequences.[28] In contrast, BLOSUM matrices derive from a larger, automated compilation of approximately 2,000 conserved, ungapped alignment blocks from over 500 diverse protein families in the BLOCKS database, enabling broader representation of substitutions across varying evolutionary distances.[28] A key methodological distinction lies in how these matrices handle evolutionary scaling. PAM matrices start with a base matrix (PAM1) derived from closely related sequences and extrapolate to greater distances through matrix powering—a process of repeated matrix multiplication to simulate multiple substitutions over time, yielding variants like PAM250 for ~20% identity alignments. BLOSUM matrices, however, avoid such extrapolation; they are computed directly from clustered sequences at specific identity thresholds (e.g., BLOSUM62 from sequences clustered at ≤62% identity), producing empirically observed log-odds scores without relying on a Markov chain model of evolution. This direct derivation makes BLOSUM more attuned to real-world alignment patterns in diverse datasets. Performance differences emerge prominently in practical applications, particularly for sequence database searches and alignments. BLOSUM matrices, especially BLOSUM62, demonstrate superior sensitivity in detecting distant homologs in the "twilight zone" of 20-35% sequence identity, with studies showing up to 20-30% improvements in alignment accuracy and false positive reduction compared to equivalent PAM matrices like PAM120. PAM matrices, by virtue of their evolutionary modeling focus, excel in phylogenetic reconstruction and global alignments of closely related sequences but underperform in local alignments of divergent proteins due to over-extrapolation artifacts.[28] Additionally, PAM score distributions tend to be more negative overall, reflecting a bias toward penalizing mismatches in close relatives, whereas BLOSUM scores are tuned for local alignments, incorporating fewer gap penalties and emphasizing conserved blocks to better handle insertions/deletions.[28] Selection guidelines favor BLOSUM62 as the default for general-purpose protein sequence comparisons, such as in BLAST searches, due to its balanced performance across evolutionary distances. PAM250, however, is preferred for evolutionary modeling and phylogeny inference where precise divergence estimation is needed. These choices align with their design intents: PAM for theoretical evolutionary simulation and BLOSUM for empirical search optimization. To illustrate score similarities and differences, the following table compares select entries from PAM250 and BLOSUM62 (both in log-odds units, scaled to approximate half-bits where applicable). Note the close alignment on highly conserved substitutions like Cys-Cys, reflecting shared reliance on observed frequencies, while divergences appear in less conserved pairs.| Substitution | PAM250 Score | BLOSUM62 Score |
|---|---|---|
| Cys-Cys | 9 | 9 |
| Ala-Ala | 1 | 4 |
| Trp-Trp | 11 | 11 |
| Ala-Ser | 1 | 1 |
| Cys-Trp | -8 | -2 |
Recent Developments
Since 2020, advancements in amino acid substitution matrices have increasingly incorporated structural data, machine learning, and domain-specific adaptations to enhance accuracy in sequence alignment and evolutionary analysis, building briefly on foundational BLOSUM and PAM approaches.[1] A notable structure-aware development is the 3Di substitution matrix, introduced in 2024, which leverages three-dimensional coordinates from the Protein Data Bank (PDB) to model amino acid substitutions based on spatial proximities in protein structures. This matrix facilitates structural phylogenetics by enabling phylogenetic inference directly from 3D alignments, addressing limitations in sequence-based methods for deeply divergent proteins. Applications include resolving challenging evolutionary relationships in protein families where sequence similarity is low, providing a foundation for revisiting complex phylogenetic problems.[29][12] Machine learning integrations have advanced with AlphaFold2-based substitution matrices, such as those developed in 2024 for assessing pathogenic impacts of motif-specific substitutions in short linear motifs (SLiMs). These matrices use AlphaFold2-predicted structures combined with FoldX-derived stability changes (ΔΔG) to generate motif-domain binding scores, filtering variants from ClinVar and gnomAD databases. For instance, the MotSASi framework achieves 97% accuracy in predicting deleterious single amino acid substitutions, outperforming methods like AlphaMissense with an F1-score of 0.98 and Matthews correlation coefficient of 0.78 across 2,335 variants, expanding high-confidence SLiM coverage by 22-fold.[30] Reduced-alphabet substitution matrices, also emerging in 2025, group the 20 standard amino acids into smaller classes (typically 10-15) based on physicochemical properties and evolutionary conservation to trace ancient coding alphabets in modern proteins. These models depart from fixed alphabets by incorporating aminoacyl-tRNA synthetase (aaRS) informed substitutions, enabling reconstruction of phylogenies from ancient protein datasets and revealing timelines more aligned with Earth's geological history. Such approaches provide insights into early genetic code evolution without assuming a universal 20-amino-acid framework.[31][32] Specialized variants include depth-dependent matrices distinguishing substitutions in buried versus surface residues, as explored in 2024 studies using structural abundance scores. These matrices predict cellular abundance changes from residue substitutions with high accuracy (Spearman correlation up to 0.73) when conditioned on burial status, aiding deleterious mutation prediction in varied protein environments. For proteins with biased compositions, such as membrane proteins, updated matrices account for transmembrane-specific frequencies, improving ortholog discrimination and clustering in AT/GC-rich genomes, as reviewed in 2020 with ongoing refinements.[33][34] Examples of practical impact include the 2021 ProtSub matrix, which incorporates coevolutionary pairs from Pfam alignments and spatial filters (≤4.5 Å) to align sequences with structures, yielding significant gains in homology detection for twilight-zone sequences across 4,184 CATH families—enhancing congruence with structure matches in over 2,000 cases. Overall, these innovations have improved alignment accuracy for distant homologs by up to 20% in benchmark tests.[35][10] Looking ahead, future trends point to AI-driven dynamic matrices tailored for personalized genomics, integrating molecular dynamics simulations and deep learning to adapt substitutions to individual variant contexts, as seen in emerging frameworks like Dynamicasome for precision medicine applications.[36]Nucleotide Substitution Matrices
Standard Models
Standard nucleotide substitution matrices are simpler than their protein counterparts due to the limited alphabet of four bases (A, C, G, T), often serving as baselines for DNA or RNA sequence alignments in phylogenetic analyses. The most basic form is the identity matrix, a 4×4 diagonal matrix where exact matches receive a positive score (typically +1) and all mismatches are scored as 0 or a uniform negative value, emphasizing identical nucleotides while penalizing any substitution equally. This matrix is particularly useful for closely related sequences where conservation is high, and it is implemented by default in tools like Clustal Omega for nucleotide alignments. To account for evolutionary patterns, more refined match-mismatch matrices distinguish between transitions (substitutions between purines A↔G or pyrimidines C↔T, which occur more frequently) and transversions (cross-substitutions like A↔C, which are rarer and often more disruptive). In these models, transitions receive higher (less penalizing) scores than transversions—for instance, a common scheme assigns +5 to matches, +1 to transitions, and -4 to transversions—to better reflect observed mutation biases in DNA evolution. Such matrices improve alignment accuracy for moderately divergent sequences by reducing noise from unlikely changes.[37][38] Empirical nucleotide substitution matrices extend these ideas by deriving scores from real data, such as aligned genomes or conserved genes, often incorporating models like HKY85, which allows unequal base frequencies and a parameter (κ) for the transition/transversion rate ratio. The HKY85 model, originally developed for mitochondrial DNA analysis, is integrated into phylogenetic software to generate context-aware matrices that capture heterogeneous substitution patterns across taxa.[39] These matrices are constructed using log-odds principles adapted from protein alignments, where observed substitution frequencies from multiple sequence alignments of conserved regions are compared against background nucleotide frequencies (e.g., via Bayesian integrals or pseudocounts) to yield scores for the 4×4 matrix, benefiting from the smaller alphabet for computational efficiency.[40] For illustration, a representative 4×4 empirical-style nucleotide substitution matrix might appear as follows, with positive scores for matches and transitions, and negatives for transversions (values rounded for clarity; actual scores vary by dataset):| A | G | C | T | |
|---|---|---|---|---|
| A | +1 | +0.5 | -1 | -1 |
| G | +0.5 | +1 | -1 | -1 |
| C | -1 | -1 | +1 | +0.5 |
| T | -1 | -1 | +0.5 | +1 |
Specialized Variants
Specialized nucleotide substitution matrices extend standard models by incorporating biological constraints such as transition-transversion biases, codon-level selection, RNA secondary structures, and context-specific mutation patterns observed in recent viral epidemics. These variants address limitations in generic models by tailoring substitution probabilities to particular genomic contexts, improving accuracy in phylogenetic inference and evolutionary analysis.[39] Transition-transversion matrices explicitly weight purine-to-purine (A↔G) or pyrimidine-to-pyrimidine (C↔T) transitions higher than purine-to-pyrimidine transversions, reflecting observed mutational biases in DNA evolution. This is achieved through parameters like kappa (κ), which scales the rate of transitions relative to transversions in extensions of the Jukes-Cantor model, such as the Hasegawa-Kishino-Yano (HKY85) model. In HKY85, the instantaneous rate matrix Q incorporates base frequencies π and κ, where off-diagonal entries for transitions are multiplied by κ, while transversions receive a base rate, capturing the twofold higher likelihood of transitions in many lineages.[43] For example, assuming equal base frequencies (π_A = π_C = π_G = π_T = 0.25) and κ = 2, a simplified 4x4 rate matrix (scaled such that rows sum to zero) penalizes transversions as follows:Q = [
[-1.0, 0.5, 0.25, 0.25],
[0.5, -1.0, 0.25, 0.25],
[0.25, 0.25, -1.0, 0.5],
[0.25, 0.25, 0.5, -1.0]
]
Q = [
[-1.0, 0.5, 0.25, 0.25],
[0.5, -1.0, 0.25, 0.25],
[0.25, 0.25, -1.0, 0.5],
[0.25, 0.25, 0.5, -1.0]
]
