Hubbry Logo
BLOSUMBLOSUMMain
Open search
BLOSUM
Community hub
BLOSUM
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
BLOSUM
BLOSUM
from Wikipedia

The BLOSUM62 matrix, the amino acids have been grouped and coloured based on Margaret Dayhoff's classification scheme. Positive and zero values have been highlighted.

In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff.[1] They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

Biological background

[edit]

The genetic instructions of every replicating cell in a living organism are contained within its DNA.[2] Throughout the cell's lifetime, this information is transcribed and replicated by cellular mechanisms to produce proteins or to provide instructions for daughter cells during cell division, and the possibility exists that the DNA may be altered during these processes.[2][3] This is known as a mutation. At the molecular level, there are regulatory systems that correct most — but not all — of these changes to the DNA before it is replicated.[3][4]

The functionality of a protein is highly dependent on its structure.[5] Changing a single amino acid in a protein may reduce its ability to carry out this function, or the mutation may even change the function that the protein carries out.[3] Changes like these may severely impact a crucial function in a cell, potentially causing the cell — and in extreme cases, the organism — to die.[6] Conversely, the change may allow the cell to continue functioning albeit differently, and the mutation can be passed on to the organism's offspring. If this change does not result in any significant physical disadvantage to the offspring, the possibility exists that this mutation will persist within the population. The possibility also exists that the change in function becomes advantageous.

The 20 amino acids translated by the genetic code vary greatly by the physical and chemical properties of their side chains.[5] However, these amino acids can be categorised into groups with similar physicochemical properties.[5] Substituting an amino acid with another from the same category is more likely to have a smaller impact on the structure and function of a protein than replacement with an amino acid from a different category.

Sequence alignment is a fundamental research method for modern biology. The most common sequence alignment for protein is to look for similarity between different sequences in order to infer function or establish evolutionary relationships. This helps researchers better understand the origin and function of genes through the nature of homology and conservation. Substitution matrices are utilized in algorithms to calculate the similarity of different sequences of proteins; however, the utility of Dayhoff PAM Matrix has decreased over time due to the requirement of sequences with a similarity more than 85%. In order to fill in this gap, Henikoff and Henikoff introduced BLOSUM (BLOcks SUbstitution Matrix) matrix which led to marked improvements in alignments and in searches using queries from each of the groups of related proteins.[1]

Terminology

[edit]
BLOSUM
Blocks Substitution Matrix, a substitution matrix used for sequence alignment of proteins.
Scoring metrics (statistical versus biological)
When evaluating a sequence alignment, one would like to know how meaningful it is. This requires a scoring matrix, or a table of values that describes the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences.[7]
BLOSUM r
The matrix built from blocks with less than r% of similarity
  • E.g., BLOSUM62 is the matrix built using sequences with less than 62% similarity (sequences with ≥ 62% identity were clustered together).
  • Note: BLOSUM 62 is the default matrix for protein BLAST. Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.[1]

Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. BLOSUM matrices with high numbers are designed for comparing closely related sequences, while those with low numbers are designed for comparing distant related sequences. For example, BLOSUM80 is used for closely related alignments, and BLOSUM45 is used for more distantly related alignments. The matrices were created by merging (clustering) all sequences that were more similar than a given percentage into one single sequence and then comparing those sequences (that were all more divergent than the given percentage value) only; thus reducing the contribution of closely related sequences. The percentage used was appended to the name, giving BLOSUM80 for example where sequences that were more than 80% identical were clustered.

Construction of BLOSUM matrices

[edit]

BLOSUM matrices are obtained by using blocks of similar amino acid sequences as data, then applying statistical methods to the data to obtain the similarity scores. Statistical Methods Steps:[8]

Eliminating Sequences

[edit]

Eliminate the sequences that are more than r% identical. There are two ways to eliminate the sequences. It can be done either by removing sequences from the block or just by finding similar sequences and replace them by new sequences which could represent the cluster. Elimination is done to remove protein sequences that are more similar than the specified threshold.

Calculating Frequency & Probability

[edit]

A database storing the sequence alignments of the most conserved regions of protein families. These alignments are used to derive the BLOSUM matrices. Only the sequences with a percentage of identity lower than the threshold are used. By using the block, counting the pairs of amino acids in each column of the multiple alignment.

Log odds ratio

[edit]

It gives the ratio of the occurrence each amino acid combination in the observed data to the expected value of occurrence of the pair. It is rounded off and used in the substitution matrix.

where is the probability of observing the pair and is the expected probability of such a pair occurring, given the background probabilities of each amino acid.

BLOSUM Matrices

[edit]

The odds for relatedness are calculated from log odd ratio, which are then rounded off to get the substitution matrices BLOSUM matrices.

Score of the BLOSUM matrices

[edit]

A scoring matrix or a table of values is required for evaluating the significance of a sequence alignment, such as describing the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases are the same at one position. All matches and mismatches are respectively given the same score (typically +1 or +5 for matches, and -1 or -4 for mismatches).[9] But it is different for proteins. Substitution matrices for amino acids are more complicated and implicitly take into account everything that might affect the frequency with which any amino acid is substituted for another. The objective is to provide a relatively heavy penalty for aligning two residues together if they have a low probability of being homologous (correctly aligned by evolutionary descent). Two major forces drive the amino-acid substitution rates away from uniformity: substitutions occur with the different frequencies, and lessen functionally tolerated than others. Thus, substitutions are selected against.[7]

Commonly used substitution matrices include the blocks substitution (BLOSUM) [1] and point accepted mutation (PAM) [10][11] matrices. Both are based on taking sets of high-confidence alignments of many homologous proteins and assessing the frequencies of all substitutions, but they are computed using different methods.[7]

Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them.[12] Every possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins.[13] A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions.

To calculate a BLOSUM matrix, the following equation is used:

Here, is the probability of two amino acids and replacing each other in a homologous sequence, and and are the background probabilities of finding the amino acids and in any protein sequence. The factor is a scaling factor, set such that the matrix contains easily computable integer values.

Variants

[edit]

BLOSUM

[edit]

BLOSUM80: more related proteins

BLOSUM62: midrange

BLOSUM45: distantly related proteins

The BLOSUM62 matrix with the amino acids in the table grouped according to the chemistry of the side chain, as in (a). Each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the BLOCKS database, clustered at the 62% level, divided by the probability that the same two amino acids might align by chance. The ratio is then converted to a logarithm and expressed as a log odds score, as for PAM. BLOSUM matrices are usually scaled in half-bit units.[14] A score of zero indicates that the frequency with which a given two amino acids were found aligned in the database was as expected by chance, while a positive score indicates that the alignment was found more often than by chance, and negative score indicates that the alignment was found less often than by chance.

PMB

[edit]

PMB (Probability Matrix from Blocks) of 2004 uses the additivity of evolutionary distances to improve on BLOSUM's analysis of the BLOCKS database. The up-to-date 2001 version of BLOCKS was used to generate a new set of BLOSUM matrices. The "observed substitution frequencies" found in these BLOSUM matrices are used to estimate actual substitution frequencies (with higher evolutionary distance, i.e. lower r, some later replacement can mask earlier replacements). PMB thus defines a true evolutionary model like PAM and JTT do. It is not a symmetric matrix.[15]

RBLOSUM

[edit]

The original code written by Henikoff and Henikoff does not exactly act according to their paper's description[1] of the algorithm. The BLOSUM62 from that program has been used for many years as standard. Surprisingly, the miscalculated BLOSUM62 improves search performance compared to the 2008 corrected version of the same relative entropy (RBLOSUM64).[16]

A 2018 article claims that RBLOSUM is better than BLOSUM and CorBLOSUM.[17]

CorBLOSUM

[edit]

A 2016 paper finds further errors in the original code not addressed by the 2008 RBLOSUM correction. The corrected version from this paper, CorBLOSUM, manages to be more effective than BLOSUM at similarity search in about 75% of cases.[18]

Some uses in bioinformatics

[edit]

Research applications

[edit]

BLOSUM scores was used to predict and understand the surface gene variants among hepatitis B virus carriers[19] and T-cell epitopes.[20]

Surface gene variants among hepatitis B virus carriers

[edit]

DNA sequences of HBsAg were obtained from 180 patients, in which 51 were chronic HBV carrier and 129 newly diagnosed patients, and compared with consensus sequences built with 168 HBV sequences imported from GenBank. Literature review and BLOSUM scores were used to define potentially altered antigenicity.[19]

Reliable prediction of T-cell epitopes

[edit]

A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. this method predicts T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design.[20]

Use in BLAST

[edit]

BLOSUM matrices are also used as a scoring matrix when comparing DNA sequences or protein sequences to judge the quality of the alignment. This form of scoring system is utilized by a wide range of alignment software including BLAST.[21]

Comparing PAM and BLOSUM

[edit]

In addition to BLOSUM matrices, a previously developed scoring matrix can be used. This is known as a PAM. The two result in the same scoring outcome, but use differing methodologies. BLOSUM looks directly at mutations in motifs of related sequences while PAM's extrapolate evolutionary information based on closely related sequences.[1]

Since both PAM and BLOSUM are different methods for showing the same scoring information, the two can be compared but due to the very different method of obtaining this score, a PAM100 does not equal a BLOSUM100.[22]

PAM BLOSUM
PAM100 BLOSUM90
PAM120 BLOSUM80
PAM160 BLOSUM62
PAM200 BLOSUM50
PAM250 BLOSUM45
The relationship between PAM and BLOSUM
[edit]
PAM BLOSUM
To compare closely related sequences, PAM matrices with lower numbers are created. To compare closely related sequences, BLOSUM matrices with higher numbers are created.
To compare distantly related proteins, PAM matrices with high numbers are created. To compare distantly related proteins, BLOSUM matrices with low numbers are created.
The differences between PAM and BLOSUM
[edit]
PAM BLOSUM
Based on global alignments of closely related proteins. Based on local alignments.
PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence but corresponds to 99% sequence identity. BLOSUM 62 is a matrix calculated from comparisons of sequences with a pairwise identity of no more than 62%.
Other PAM matrices are extrapolated from PAM1. Based on observed alignments; they are not extrapolated from comparisons of closely related proteins.
Higher numbers in matrices naming scheme denote larger evolutionary distance. Larger numbers in matrices naming scheme denote higher sequence similarity and therefore smaller evolutionary distance.[23]

Availability

[edit]

The "reference" version of BLOSUM is found in the NCBI toolkits. Both the older (deprecated) NCBI C Toolkit and the current NCBI C++ Toolkit provide the BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, and BLOSUM90 matrices. Both also offer APIs for making use of the matrices.[24][25]

The original source code for calculating BLOSUM is also found on the NCBI website, at https://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/. This archive "blosum.tar.Z" represents the original miscalculated version with improved search performance from 1992.[16] The archive also contains pre-calculated BLOSUM outputs at the following similarity levels: "-2" (blosumn), 30, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 95, and 100.[26]

Software Packages

[edit]

There are several software packages in different programming languages that allow easy use of Blosum matrices. Besides the aforementioned NCBI Toolkits, there are:

... and many more.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
BLOSUM (BLOcks SUbstitution Matrix) is a family of empirically derived substitution matrices used in bioinformatics to score alignments of protein sequences by quantifying the likelihood of amino acid substitutions based on evolutionary conservation. Developed by Steven Henikoff and Jorja G. Henikoff in 1992, these matrices are constructed from observed substitution frequencies in highly conserved, ungapped blocks of aligned protein segments extracted from the BLOCKS database, which contains over 2,000 such blocks representing more than 500 families of related proteins. Unlike earlier models like PAM matrices that extrapolate from closely related sequences, BLOSUM matrices are directly derived from alignments of distantly related proteins, making them particularly effective for detecting remote homologies. The construction of BLOSUM matrices involves clustering sequences within each block at a specified identity threshold (e.g., 62% for BLOSUM62) to reduce from closely related sequences, followed by counting pairwise substitutions and computing log-odds scores that compare observed frequencies to expected random substitutions. Positive scores indicate conservative substitutions likely due to , while negative scores reflect rare or non-conservative changes; for instance, the BLOSUM62 matrix assigns a score of 4 to identical matches like alanine-alanine, 11 to tryptophan-tryptophan, and -4 to dissimilar pairs like aspartic acid-leucine. This approach yields a series of matrices (e.g., BLOSUM30 for highly divergent sequences, BLOSUM80 for closer relatives), with lower numbers corresponding to deeper evolutionary distances. BLOSUM matrices are widely applied in algorithms, phylogenetic analysis, and database searching tools such as BLAST, where BLOSUM62 serves as the default scoring matrix for protein queries due to its balance between in identifying moderately distant homologs. Their empirical basis from real protein blocks enhances performance over theoretical models, especially for global and local alignments in diverse biological contexts, including structural prediction and functional annotation. Ongoing refinements and specialized variants, such as tcrBLOSUM for analysis, continue to improve their accuracy in modern bioinformatics pipelines as of 2025.

Biological and Conceptual Foundations

Protein Sequence Alignment Needs

Proteins evolve primarily through point mutations, insertions, and deletions (indels), which introduce variations in their sequences over time. Point mutations alter single , while indels add or remove segments, potentially reshaping and function. However, functional constraints—such as maintaining active sites, structural stability, or binding interfaces—limit acceptable changes, leading to the preservation of specific blocks across evolutionary lineages. These conserved blocks serve as signatures of shared ancestry and functional importance, enabling researchers to detect evolutionary relationships even when overall similarity is low. Aligning protein sequences is crucial for inferring evolutionary history and functional conservation, but it becomes particularly challenging for distantly related proteins, where identity often drops below 25%. In such cases, neutral mutations—those with little impact on fitness—accumulate rapidly through , obscuring signal from functional (selectively constrained) changes that are preserved due to their role in maintaining protein performance. This distinction between neutral and adaptive substitutions complicates alignment accuracy, as neutral drift can lead to divergent sequences that mask homologous regions, while purifying selection enforces conservation in critical areas. The neutralist-selectionist debate underscores this tension: neutral theory posits most substitutions are non-adaptive, fixed by drift, whereas selectionist views emphasize adaptive pressures shaping functional sites, influencing how alignments interpret evolutionary rates (e.g., via dN/dS ratios). Multiple sequence alignments (MSAs) address these challenges by integrating sequences from related proteins within families, highlighting conserved domains that reflect evolutionary and functional constraints. In protein families, MSAs reveal motifs or blocks where are highly preserved, such as or in structural turns or in bonds, indicating regions under strong selective pressure. Tools like the Conserved Domains Database (CDD) leverage MSAs to model these domains, converting them into position-specific score matrices for detecting homology and annotating functions across diverse sequences. By focusing on these conserved elements, MSAs facilitate the identification of distant homologs and the distinction between neutral variability and functionally vital conservation.

Role of Substitution Matrices

A is a scoring system that assigns numerical values to pairs of (or ) based on the observed likelihood of one replacing the other during evolutionary processes. These scores reflect the relative frequency of substitutions derived from alignments of related protein sequences, enabling the quantification of similarity beyond exact matches. Substitution matrices fall into two primary categories: Dayhoff-style matrices, such as the Percent Accepted Mutations (PAM) series, which are extrapolated from closely related sequences using a of , and block-based matrices, such as BLOSUM (BLOcks ), which are derived directly from conserved blocks in distantly related proteins without extrapolation. The PAM approach, pioneered by Margaret Dayhoff, models evolutionary changes over time by counting accepted point mutations in phylogenetically close proteins, while BLOSUM matrices, developed by Steven and Jorja Henikoff, emphasize empirical frequencies from local alignments to capture substitutions across a broader range of evolutionary distances. This distinction allows PAM matrices to suit analyses of closely related sequences and BLOSUM matrices to perform better for more divergent ones. The core purpose of substitution matrices in bioinformatics is to reward biologically plausible alignments and penalize improbable ones, thereby enhancing the detection of homologous proteins that share common ancestry despite sequence divergence. By incorporating evolutionary patterns, these matrices improve the of algorithms, facilitating accurate of protein function, , and evolutionary relationships. In practice, substitution matrices assign positive scores to conservative substitutions—such as (Asp) to (Glu), both negatively charged residues likely to preserve protein function—and negative scores to radical changes, like (Trp), a large aromatic residue, to (Gly), a small non-polar one, which are evolutionarily rare and disruptive. This scoring scheme enables quantitative evaluation of alignment quality in widely used tools, including BLAST for rapid database searches and for multiple sequence alignments, where higher total scores indicate more reliable homologies.

Historical Development and Terminology

Origins and Key Contributors

The BLOSUM (BLOcks SUbstitution Matrix) substitution matrices were developed in 1992 by Steven Henikoff and Jorja G. Henikoff, researchers affiliated with the at the Fred Hutchinson Cancer Research Center in Seattle, Washington. Their work addressed key shortcomings in prior substitution models, particularly the PAM matrices, which relied on extrapolations from alignments of closely related proteins and struggled with detecting distant evolutionary relationships due to accumulated mutations. Instead, the Henikoffs pioneered a block-based method that analyzed conserved, gap-free segments of protein alignments drawn from the BLOCKS database, a resource they had earlier assembled containing over 2,000 blocks from more than 500 protein families. This innovation marked a shift from global sequence alignment strategies, which treated entire proteins uniformly, to a focus on , highly conserved blocks that better capture substitutions across divergent homologs without the biases of extrapolation. The approach enabled the derivation of log-odds matrices directly from observed frequencies in diverse, evolutionarily varied data, enhancing accuracy in similarity searches and alignments. The seminal publication, "Amino acid substitution matrices from protein blocks," appeared in the Proceedings of the National Academy of Sciences in November 1992, establishing the BLOSUM framework and introducing multiple matrices tuned to different divergence levels. Among these, BLOSUM62 quickly gained prominence post-publication for its effective balance of sensitivity to weak similarities and specificity against false positives, becoming the in tools like BLAST for protein database searches.

Core Terminology

In the context of BLOSUM matrices, core terminology revolves around concepts central to deriving substitution scores from conserved protein alignments, ensuring precise communication in bioinformatics analyses of . These terms originate from the foundational work on protein blocks and are essential for understanding how empirical data informs scoring systems without relying on evolutionary models like those in PAM matrices. A block refers to a contiguous, ungapped segment of aligned protein sequences derived from highly conserved regions, capturing local similarities among related proteins without insertions or deletions. These blocks form the basic units for observing substitution patterns in BLOSUM . The observed frequency (fijf_{ij}) denotes the empirical count or relative frequency with which ii and jj appear aligned in pairs across the collected blocks, providing a direct measure of substitutions in conserved contexts. This frequency is scaled based on clustering to account for sequence redundancy. The target frequency (qijq_{ij}) represents the estimated probability that ii substitutes for jj over evolutionary time, derived from the observed in blocks to model realistic likelihoods independent of close relatedness. It emphasizes substitutions in distantly related sequences. The background frequency (pip_i) is the overall relative occurrence rate of ii across all positions in the protein blocks or a broader protein , serving as a baseline to distinguish random alignments from evolutionarily significant ones. The clustering threshold specifies the minimum percentage identity (e.g., 62%) used to group similar sequences within blocks, reducing from overrepresented sequences and allowing focus on diverse evolutionary signals; higher thresholds yield matrices suited for closer homologs. These terms are derived from the BLOCKS database, a repository of ungapped multiple alignments of conserved protein regions, which was originally constructed from protein families documented in . The BLOCKS database, developed by Henikoff and colleagues, facilitated the historical use of blocks in derivation.

Construction Process

Sequence Clustering and Block Selection

The construction of BLOSUM matrices begins with the selection and preparation of protein sequence alignments from the BLOCKS database, a repository of conserved, ungapped alignment blocks derived from globally aligned protein families. These blocks represent regions of high similarity within related proteins, ensuring that the data captures evolutionary substitutions in conserved contexts without the complications of gaps. Originally compiled in the early 1990s, the BLOCKS database provided approximately 2,000 blocks from over 500 diverse protein groups, emphasizing alignments from distantly related sequences to reflect broader evolutionary patterns. Blocks are selected based on strict criteria to maintain quality and relevance: each must be at least 5 residues long and include alignments of two or more sequences from the same protein family. Blocks are selected from diverse protein families to promote variation across taxonomic and functional categories. This minimum length ensures sufficient statistical power for substitution analysis while focusing on locally conserved motifs, such as those in active sites or structural domains. The emphasis on diverse families helps mitigate , as blocks are drawn from a wide array of proteins rather than over-representing any single lineage. To eliminate redundancy and prevent over-representation of closely related sequences, single-linkage is applied within each block at a specified sequence identity threshold, such as 62% for the BLOSUM62 matrix. In this process, sequences sharing identity above the threshold are grouped into clusters, with each cluster treated as a single representative to down-weight highly similar entries. Clusters are then weighted inversely proportional to their size (i.e., = 1 / number of sequences in the cluster), ensuring that larger families contribute no more than smaller ones and that the overall reflects phylogenetic diversity without toward prolific sequence clusters.

Observed Frequencies and Probabilities

In the construction of BLOSUM matrices, observed frequencies are derived by tallying the occurrences of aligned pairs across the selected protein blocks, with contributions weighted according to cluster memberships to mitigate bias from overrepresented related sequences. For each block, the count of a pair (i, j) is determined by summing the products of cluster weights for sequences containing i and those containing j at aligned positions, ensuring that closely related sequences contribute less to the overall tally. This weighted counting process aggregates data from thousands of blocks in databases like BLOCKS, providing an empirical estimate of substitution patterns in conserved regions. The observed frequency matrix FF is formed as fij=weighted pairs (i,j)total weighted pairs across all blocksf_{ij} = \frac{\sum \text{weighted pairs (i,j)}}{\text{total weighted pairs across all blocks}}, where the numerator sums the weighted occurrences of each pair type over all blocks, and the denominator normalizes by the aggregate weighted pair count. This yields a symmetric matrix (fij=fjif_{ij} = f_{ji}) that reflects the undirected nature of amino acid substitutions in evolutionary alignments, with diagonal elements fiif_{ii} capturing identity matches alongside conservative substitutions. The approach prioritizes local alignments without gaps, focusing on high-confidence conserved segments to enhance reliability. Target probabilities qijq_{ij} are obtained directly from the observed frequencies as qij=fijq_{ij} = f_{ij}, already normalized such that i,jqij=1\sum_{i,j} q_{ij} = 1; for diagonal terms, these encompass both identical and similar pairings. Background probabilities for individual amino acids are then computed as the marginals pi=jqijp_i = \sum_j q_{ij}, representing the overall frequency of each residue in the aligned dataset and serving as the basis for expected random alignments in subsequent scoring. A precise formulation for the target probabilities is qij=1Mblocks(wbcij,b),q_{ij} = \frac{1}{M} \sum_{\text{blocks}} \left( w_b \cdot c_{ij,b} \right), where MM is the total number of weighted positions across all blocks (equivalent to half the total weighted pairs for off-diagonals, adjusted for normalization), wbw_b is the block-specific or pair weight derived from clustering, and cij,bc_{ij,b} is the raw count of aligned i-j pairs in block bb. This summation ensures the probabilities capture the empirical distribution of substitutions while maintaining symmetry.

Log-Odds Ratio Derivation

The log-odds ratio in BLOSUM matrices transforms observed substitution probabilities into scores that quantify the likelihood of evolutionary relatedness relative to chance, thereby prioritizing biologically plausible alignments over random matches. This approach, rooted in , measures the "surprise" or information content of an observed pair by taking the logarithm of the ratio between its observed frequency and its expected frequency under independence, with base-2 logarithms yielding scores in bit units. The core formula for the log-odds score sijs_{ij} between ii and jj is derived as follows: sij=2log2(qijpipj)s_{ij} = 2 \log_2 \left( \frac{q_{ij}}{p_i p_j} \right) Here, qijq_{ij} represents the observed probability of the pair (i,j)(i, j) in aligned blocks, while pipjp_i p_j is the expected probability assuming independent occurrence based on background frequencies pip_i and pjp_j. The factor of 2 scales the scores to half-bit units, and the result is rounded to the nearest for computational efficiency and -based alignment algorithms. Diagonal elements siis_{ii}, corresponding to identical amino acids, are typically positive because conserved residues occur more frequently than expected by chance, reflecting for preservation. Off-diagonal scores are negative for substitutions rarer than random expectation, indicating unlikely changes, while scores near zero approximate neutral or random pairings. In modern implementations, the resulting 20×20 enforces sij=sjis_{ij} = s_{ji} due to the reciprocal nature of substitution probabilities, ensuring consistent scoring across pairs.

Matrix Generation and Scoring

The BLOSUM matrix is constructed as a symmetric 20×20 table, with rows and columns corresponding to the 20 standard amino acids, where each entry sijs_{ij} quantifies the log-odds ratio for aligning amino acid ii with jj. The diagonal elements siis_{ii} capture scores for identical matches, which are typically positive and reflect the relative frequency of self-substitutions in conserved protein blocks. This assembly ensures the matrix is undirected, meaning sij=sjis_{ij} = s_{ji}, facilitating its use in bidirectional sequence comparisons without bias toward directionality. To enhance computational efficiency and interpretability, BLOSUM scores are scaled and rounded to the nearest in half-bit units, achieved by multiplying the log-odds values by 2ln2\frac{2}{\ln 2} (approximately 2.885) before rounding. This scaling preserves the additive properties essential for dynamic programming algorithms in , where scores accumulate linearly along the alignment path, and each unit of score corresponds to an of 2\sqrt{2}
Add your contribution
Related Hubs
User Avatar
No comments yet.