Hubbry Logo
Sequence homologySequence homologyMain
Open search
Sequence homology
Community hub
Sequence homology
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Sequence homology
Sequence homology
from Wikipedia
Gene phylogeny as red and blue branches within grey species phylogeny. Top: An ancestral gene duplication produces two paralogs (histone H1.1 and 1.2). A speciation event produces orthologs in the two daughter species (human and chimpanzee). Bottom: in a separate species (E. coli), a gene has a similar function (histone-like nucleoid-structuring protein) but has a separate evolutionary origin and so is an analog.

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal (or lateral) gene transfer event (xenologs).[1]

Homology among DNA, RNA, or proteins is typically inferred from their nucleotide or amino acid sequence similarity. Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence. Alignments of multiple sequences are used to indicate which regions of each sequence are homologous.

Identity, similarity, and conservation

[edit]
A sequence alignment of mammalian histone proteins. Sequences are the middle 120-180 amino acid residues of the proteins. Residues that are conserved across all sequences are highlighted in grey. The key below denotes conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).[2]

The term "percent homology" is often used to mean "sequence similarity", that is the percentage of identical residues (percent identity), or the percentage of residues conserved with similar physicochemical properties (percent similarity), e.g. leucine and isoleucine, is usually used to "quantify the homology." Based on the definition of homology specified above this terminology is incorrect since sequence similarity is the observation, homology is the conclusion.[3] Sequences are either homologous or not.[3] This involves that the term "percent homology" is a misnomer.[4]

As with morphological and anatomical structures, sequence similarity might occur because of convergent evolution, or, as with shorter sequences, by chance, meaning that they are not homologous. Homologous sequence regions are also called conserved. This is not to be confused with conservation in amino acid sequences, where the amino acid at a specific position has been substituted with a different one that has functionally equivalent physicochemical properties.

Partial homology can occur where a segment of the compared sequences has a shared origin, while the rest does not. Such partial homology may result from a gene fusion event.

Beyond sequence similarity

[edit]

Proteins are known to conserve their tertiary structure more strongly than their amino acid sequences. Two distantly related proteins can have minimal or even undetectable sequence similarity, yet have highly similar folds that can be compared via structural alignment. Examples of these proteins used to be only discovered by experimental structual determination methods. Modern protein structure prediction methods such as AlphaFold2 allow possible homologs to be identified without wet lab work.[5]

RNA is also known to conserve tertiary structure more strongly than primary structure. RNA secondary structure prediction was found to be helpful in human-to-mouse comparison.[6]

Orthology

[edit]
Top: An ancestral gene duplicates to produce two paralogs (Genes A and B). A speciation event produces orthologs in the two daughter species. Bottom: in a separate species, an unrelated gene has a similar function (Gene C) but has a separate evolutionary origin and so is an analog.

Homologous sequences are orthologous if they are inferred to be descended from the same ancestral sequence separated by a speciation event: when a species diverges into two separate species, the copies of a single gene in the two resulting species are said to be orthologous. Orthologs, or orthologous genes, are genes in different species that originated by vertical descent from a single gene of the last common ancestor. The term "ortholog" was coined in 1970 by the molecular evolutionist Walter Fitch.[7]

For instance, the plant Flu regulatory protein is present both in Arabidopsis (multicellular higher plant) and Chlamydomonas (single cell green algae). The Chlamydomonas version is more complex: it crosses the membrane twice rather than once, contains additional domains and undergoes alternative splicing. However, it can fully substitute the much simpler Arabidopsis protein, if transferred from algae to plant genome by means of genetic engineering. Significant sequence similarity and shared functional domains indicate that these two genes are orthologous genes,[8] inherited from the shared ancestor.

Orthology is strictly defined in terms of ancestry. Given that the exact ancestry of genes in different organisms is difficult to ascertain due to gene duplication and genome rearrangement events, the strongest evidence that two similar genes are orthologous is usually found by carrying out phylogenetic analysis of the gene lineage. Orthologs often, but not always, have the same function.[9]

Orthologous sequences provide useful information in taxonomic classification and phylogenetic studies of organisms. The pattern of genetic divergence can be used to trace the relatedness of organisms. Two organisms that are very closely related are likely to display very similar DNA sequences between two orthologs. Conversely, an organism that is further removed evolutionarily from another organism is likely to display a greater divergence in the sequence of the orthologs being studied.[citation needed]

Databases of orthologous genes and de novo orthology inference tools

[edit]

Given their tremendous importance for biology and bioinformatics, orthologous genes have been organized in several specialized databases that provide tools to identify and analyze orthologous gene sequences. These resources employ approaches that can be generally classified into those that use heuristic analysis of all pairwise sequence comparisons, and those that use phylogenetic methods. Sequence comparison methods were first pioneered in the COGs database in 1997.[10] These methods have been extended and automated in twelve different databases the most advanced being AYbRAH Analyzing Yeasts by Reconstructing Ancestry of Homologs[11] as well as these following databases right now. Some tools predict orthologous de novo from the input protein sequences, might not provide any Database. Among these tools are SonicParanoid and OrthoFinder.

  • eggNOG[12][13]
  • GreenPhylDB[14][15] for plants
  • InParanoid[16][17] focuses on pairwise ortholog relationships
  • OHNOLOGS[18][19] is a repository of the genes retained from whole genome duplications in the vertebrate genomes including human and mouse.
  • OMA[20]
  • OrthoDB[21] appreciates that the orthology concept is relative to different speciation points by providing a hierarchy of orthologs along the species tree.
  • OrthoInspector[22] is a repository of orthologous genes for 4753 organisms covering the three domains of life
  • OrthologID[23][24]
  • OrthoMaM[25][26][27] for mammals
  • OrthoMCL[28][29]
  • Roundup[30]
  • SonicParanoid[31][32] is a graph based method that uses machine learning to reduce execution times and infer orthologs at the domain level.

Tree-based phylogenetic approaches aim to distinguish speciation from gene duplication events by comparing gene trees with species trees, as implemented in databases and software tools such as:

A third category of hybrid approaches uses both heuristic and phylogenetic methods to construct clusters and determine trees, for example:

Paralogy

[edit]

Paralogous genes are genes that are related via duplication events in the last common ancestor (LCA) of the species being compared. They result from the mutation of duplicated genes during separate speciation events. When descendants from the LCA share mutated homologs of the original duplicated genes then those genes are considered paralogs.[1]

As an example, in the LCA, one gene (gene A) may get duplicated to make a separate similar gene (gene B), those two genes will continue to get passed to subsequent generations. During speciation, one environment will favor a mutation in gene A (gene A1), producing a new species with genes A1 and B. Then in a separate speciation event, one environment will favor a mutation in gene B (gene B1) giving rise to a new species with genes A and B1. The descendants' genes A1 and B1 are paralogous to each other because they are homologs that are related via a duplication event in the last common ancestor of the two species.[1]

Additional classifications of paralogs include alloparalogs (out-paralogs) and symparalogs (in-paralogs). Alloparalogs are paralogs that evolved from gene duplications that preceded the given speciation event. In other words, alloparalogs are paralogs that evolved from duplication events that happened in the LCA of the organisms being compared. The example above is an example alloparalogy. Symparalogs are paralogs that evolved from gene duplication of paralogous genes in subsequent speciation events. From the example above, if the descendant with genes A1 and B underwent another speciation event where gene A1 duplicated, the new species would have genes B, A1a, and A1b. In this example, genes A1a and A1b are symparalogs.[1]

Vertebrate Hox genes are organized in sets of paralogs. Each Hox cluster (HoxA, HoxB, etc.) is on a different chromosome. For instance, the human HoxA cluster is on chromosome 7. The mouse HoxA cluster shown here has 11 paralogous genes (2 are missing).[41]

Paralogous genes can shape the structure of whole genomes and thus explain genome evolution to a large extent. Examples include the Homeobox (Hox) genes in animals. These genes not only underwent gene duplications within chromosomes but also whole genome duplications. As a result, Hox genes in most vertebrates are clustered across multiple chromosomes with the HoxA-D clusters being the best studied.[41]

Another example are the globin genes which encode myoglobin and hemoglobin and are considered to be ancient paralogs. Similarly, the four known classes of hemoglobins (hemoglobin A, hemoglobin A2, hemoglobin B, and hemoglobin F) are paralogs of each other. While each of these proteins serves the same basic function of oxygen transport, they have already diverged slightly in function: fetal hemoglobin (hemoglobin F) has a higher affinity for oxygen than adult hemoglobin. Function is not always conserved, however. Human angiogenin diverged from ribonuclease, for example, and while the two paralogs remain similar in tertiary structure, their functions within the cell are now quite different.[citation needed]

It is often asserted that orthologs are more functionally similar than paralogs of similar divergence, but several papers have challenged this notion.[42][43][44]

Regulation

[edit]

Paralogs are often regulated differently, e.g. by having different tissue-specific expression patterns (see Hox genes). However, they can also be regulated differently on the protein level. For instance, Bacillus subtilis encodes two paralogues of glutamate dehydrogenase: GudB is constitutively transcribed whereas RocG is tightly regulated. In their active, oligomeric states, both enzymes show similar enzymatic rates. However, swaps of enzymes and promoters cause severe fitness losses, thus indicating promoter–enzyme coevolution. Characterization of the proteins shows that, compared to RocG, GudB's enzymatic activity is highly dependent on glutamate and pH.[45]

Paralogous chromosomal regions

[edit]

Sometimes, large regions of chromosomes share gene content similar to other chromosomal regions within the same genome.[46] They are well characterised in the human genome, where they have been used as evidence to support the 2R hypothesis. Sets of duplicated, triplicated and quadruplicated genes, with the related genes on different chromosomes, are deduced to be remnants from genome or chromosomal duplications. A set of paralogy regions is together called a paralogon.[47] Well-studied sets of paralogy regions include regions of human chromosome 2, 7, 12 and 17 containing Hox gene clusters, collagen genes, keratin genes and other duplicated genes,[48] regions of human chromosomes 4, 5, 8 and 10 containing neuropeptide receptor genes, NK class homeobox genes and many more gene families,[49][50][51] and parts of human chromosomes 13, 4, 5 and X containing the ParaHox genes and their neighbors.[52] The Major histocompatibility complex (MHC) on human chromosome 6 has paralogy regions on chromosomes 1, 9 and 19.[53] Much of the human genome seems to be assignable to paralogy regions.[54]

Ohnology

[edit]
A whole genome duplication event produces a genome with two ohnolog copies of each gene.
A speciation event produces orthologs of a gene in the two daughter species. A horizontal gene transfer event from one species to another adds a xenolog of the gene to its genome.
A speciation event produces orthologs of a gene in the two daughter species. Subsequent hybridisation of those species generates a hybrid genome with a homoeolog copy of each gene from both species.

Ohnologous genes are paralogous genes that have originated by a process of whole-genome duplication. The name was first given in honour of Susumu Ohno by Ken Wolfe.[55] Ohnologues are useful for evolutionary analysis because all ohnologues in a genome have been diverging for the same length of time (since their common origin in the whole genome duplication). Ohnologues are also known to show greater association with cancers, dominant genetic disorders, and pathogenic copy number variations.[56][57][58][59][60]

Xenology

[edit]

Homologs resulting from horizontal gene transfer between two organisms are termed xenologs. Xenologs can have different functions if the new environment is vastly different for the horizontally moving gene. In general, though, xenologs typically have similar function in both organisms. The term was coined by Walter Fitch.[7]

Homoeology

[edit]

Homoeologous (also spelled homeologous) chromosomes or parts of chromosomes are those brought together following inter-species hybridization and allopolyploidization to form a hybrid genome, and whose relationship was completely homologous in an ancestral species.[61] In allopolyploids, the homologous chromosomes within each parental sub-genome should pair faithfully during meiosis, leading to disomic inheritance; however in some allopolyploids, the homoeologous chromosomes of the parental genomes may be nearly as similar to one another as the homologous chromosomes, leading to tetrasomic inheritance (four chromosomes pairing at meiosis), intergenomic recombination, and reduced fertility.[citation needed]

Gametology

[edit]

Gametology denotes the relationship between homologous genes on non-recombining, opposite sex chromosomes. The term was coined by García-Moreno and Mindell.[62] 2000. Gametologs result from the origination of genetic sex determination and barriers to recombination between sex chromosomes. Examples of gametologs include CHDW and CHDZ in birds.[62]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Sequence homology refers to the shared evolutionary ancestry between biological sequences, such as those of DNA, RNA, or proteins, where similarity arises from descent from a common ancestor rather than convergent evolution. This concept, distinct from mere sequence similarity—which is a quantifiable metric independent of evolutionary history—enables inferences about functional, structural, and phylogenetic relationships among biomolecules. In practice, sequence homology is inferred through computational tools that align sequences and evaluate , such as BLAST or , which use metrics like E-values (typically <10⁻⁶ for confident protein homology) to distinguish true evolutionary relatedness from random matches. Homologous sequences are categorized into types including orthologs, which evolve via speciation and often retain similar functions across species, and paralogs, which result from gene duplication events within a lineage and may acquire new roles through divergence. Less commonly, xenologs arise from horizontal gene transfer, complicating traditional homology assessments. The study of sequence homology is pivotal in bioinformatics for genome annotation, where over 80% of metagenomic sequences are matched to known homologs to predict functions, and in evolutionary biology for reconstructing phylogenetic trees. Protein sequence homology searches are notably more sensitive than nucleotide comparisons, detecting relationships at identity levels as low as 20% when supported by structural or functional data, overturning outdated thresholds like the 30% identity rule. Applications extend to protein structure modeling, allergenicity prediction (e.g., requiring >35% identity over 80 ), and analysis, underscoring its role in advancing medical, agricultural, and biotechnological research.

Basic Principles

Sequence Identity and Similarity

Sequence identity is defined as the percentage of positions in an aligned pair of biological sequences where the residues or nucleotides are exactly the same. This metric quantifies the extent of exact matches between homologous sequences, serving as a fundamental measure in assessing their relatedness. For instance, in nucleotide sequences, identity counts matching bases (A with A, etc.), while in protein sequences, it counts identical amino acids at corresponding positions after alignment. The calculation of sequence identity relies on a prior alignment of the two sequences, expressed mathematically as: Identity=(number of identical positionstotal number of aligned positions)×100\text{Identity} = \left( \frac{\text{number of identical positions}}{\text{total number of aligned positions}} \right) \times 100 This formula excludes gaps introduced during alignment unless specified otherwise, and the total aligned positions include only those contributing to the . Sequence similarity extends beyond exact matches by incorporating conservative substitutions, where with similar physicochemical properties (e.g., hydrophobic to ) are scored positively. This broader measure is computed using substitution matrices that assign scores based on observed evolutionary substitutions, such as the Point Accepted Mutation (PAM) matrices derived from closely related protein alignments or the Block Substitution Matrix () series from conserved blocks in distantly related proteins. PAM matrices, introduced by Dayhoff et al., model evolutionary changes over time with log-odds ratios reflecting the likelihood of substitutions relative to chance. BLOSUM matrices, developed by Henikoff and Henikoff, use local alignments to derive scores, with BLOSUM62 being widely adopted for general protein similarity searches due to its balance between . Similarity scores are typically the sum of matrix values for aligned pairs, often normalized or converted to bit scores for , where positive scores indicate similarity exceeding random expectation via dot-product-like summations adjusted by gap penalties. These metrics depend on the alignment method used to generate the comparable positions. Global alignment, as implemented in the Needleman-Wunsch algorithm, optimizes similarity across the entire lengths of two sequences, suitable for comparing full-length homologous proteins like orthologs of similar size. In contrast, local alignment via the Smith-Waterman algorithm focuses on the highest-scoring subsequences, ideal for detecting conserved domains within longer, divergent sequences. Both approaches incorporate affine gap penalties to account for insertions or deletions, ensuring robust underpinning of identity and similarity calculations. As an illustrative example, consider the protein sequences of human alpha (HBA) and beta (HBB) chains, which are homologous subunits forming the tetrameric oxygen carrier. Aligning their 141- and 147-amino-acid sequences, respectively, using a global method reveals positions of exact identity for the identity metric and applies a matrix like BLOSUM62 to score conservative changes (e.g., to ) for similarity, highlighting regions of functional conservation despite overall divergence. Such calculations typically yield identity values of approximately 49% for these chains, with similarity scores reflecting additional shared properties. Conservation can be viewed as a qualitative extension of similarity, emphasizing persistently similar residues across multiple homologs.

Conservation and Functional Implications

Conservation in sequence homology refers to the retention of specific residues or motifs across homologous sequences, primarily driven by purifying selection that eliminates deleterious mutations to maintain functional integrity. This process ensures that critical sequence elements remain largely unchanged despite evolutionary divergence, serving as indicators of shared ancestry and essential biological roles. Sequence identity and similarity provide the foundational metrics for detecting these conserved regions within multiple sequence alignments. Conserved elements typically include active sites in enzymes, binding domains for substrates or cofactors, and regulatory sequences such as promoter motifs or splice sites. For instance, in zinc-finger proteins, a of DNA-binding transcription factors, the and residues that coordinate ions are highly conserved, enabling the characteristic fold and recognition. These elements often correspond to regions under strong purifying selection, where even single changes could disrupt molecular interactions or stability. The functional implications of sequence conservation are profound, as it strongly correlates with indispensable roles in cellular processes; for example, catalytic residues in enzymes like serine proteases maintain near-identical configurations across distant homologs to preserve reaction specificity. Conservation highlights evolutionary pressures favoring functionality over variability, with highly conserved proteins such as histones—essential for packaging—exhibiting substitution rates as low as 0.006 × 10⁻⁹ per site per year, in contrast to rapidly evolving fibrinopeptides involved in blood clotting, which accumulate changes at rates up to 4.3 × 10⁻⁹ per site per year due to weaker selective constraints. This variation underscores how purifying selection acts more stringently on sequences central to core cellular functions, providing evidence of homology through while reflecting protein-specific evolutionary rates. To quantify conservation, scores are derived from multiple sequence alignments using information-theoretic measures like , which assesses positional variability by calculating the uncertainty in distributions across aligned sequences (lower indicates higher conservation), or Jensen-Shannon , which compares the divergence of site-specific frequency profiles from a background model to identify functionally constrained positions. These metrics enable prioritization of residues likely under purifying selection, aiding in the of homologous sequences and of functional sites.

Beyond Sequence Similarity

Detecting sequence homology becomes particularly challenging when sequence identity falls below approximately 25%, a range known as the "twilight zone" where fewer than 10% of alignments reliably indicate true homology due to extensive evolutionary divergence. In this regime, standard pairwise sequence comparisons often fail to distinguish homologous proteins from unrelated ones, necessitating alternative evidence such as structural or functional similarities to infer common ancestry. The concept of distant homology, where evolutionary relationships persist despite low sequence similarity, was pioneered by Margaret Dayhoff in the 1970s through the development of Point Accepted Mutation (PAM) matrices, which model long-term substitution patterns to score alignments of diverged sequences. These matrices, derived from observed mutations in closely related proteins and extrapolated for greater evolutionary distances, enabled the identification of remote homologs by quantifying the likelihood of alignments under an evolutionary model. To address limitations in sequence-based detection, profile-based search methods like PSI-BLAST (Position-Specific Iterated BLAST) construct position-specific scoring matrices (PSSMs) from initial alignments, iteratively refining profiles to detect distant homologs with improved sensitivity over standard BLAST. More recently, structure prediction tools such as have revolutionized homology inference by generating accurate 3D models from sequences alone, allowing structural comparisons to reveal relationships obscured by sequence divergence. These predicted structures can be compared to known folds, facilitating the annotation of remote homologs even in the absence of experimental data. Structural homology provides robust evidence for evolutionary relatedness when sequences are too diverged, often detected through comparisons in databases like (Structural Classification of Proteins) and CATH (Class, Architecture, Topology, and Homologous superfamily), which hierarchically classify protein domains based on fold topology and shared ancestry. For instance, the Rossmann fold—a β-α-β motif common to nucleotide-binding dehydrogenases—exhibits structural conservation across diverse enzymes, enabling fold-based detection of homologs with sequence identities below 20%. Such fold similarities in and CATH often confirm homology where sequence data alone is inconclusive. Functional homology further supports inferences of common origin, as diverged proteins may retain shared biochemical roles through conserved s despite minimal sequence overlap. In serine proteases, for example, distant homologs within the same clan preserve the (Ser-His-Asp) and mechanism for hydrolysis, allowing functional annotation and evolutionary tracing even at low sequence identities. This conservation of geometry, verifiable through structural alignments, underscores how functional parallels complement sequence and structural evidence in establishing homology.

Orthology

Definition and Evolutionary Origin

Orthologs are defined as genes in different that have evolved from a single ancestral through events, resulting in vertical descent and typically preserving similar functions across lineages. This form of homology contrasts with other types by excluding as the originating mechanism; instead, orthologs diverge solely due to the separation of from a common , without intervening duplication within either lineage. The concept emphasizes that orthologous genes share a history of inheritance from the last common , often reflected in their similarity, which serves as a primary indicator of this relationship. The evolutionary origin of orthology traces back to , where a present in the common splits into separate lineages as populations diverge, leading to gradual sequence changes while core functions are maintained to support organismal fitness. For instance, the alpha-globin (HBA1 in humans and its counterpart in mice) exemplifies this, as these orthologs arose from events in mammals and retain roles in oxygen transport within , with high sequence conservation underscoring their shared ancestry. Similarly, , such as those in the HoxA cluster, are orthologous across vertebrates like humans, mice, and ; these transcription factors regulate body patterning during development, and their divergence post- has preserved collinear expression patterns essential for axial organization. Another classic case is the cytochrome c (CYCS in humans), whose orthologs are highly conserved across diverse eukaryotes—from to mammals—due to its critical role in mitochondrial electron transport, with minimal sequence variation over billions of years of evolution. The term "ortholog" was coined by Walter M. Fitch in to distinguish this speciation-based homology from paralogy, enabling clearer phylogenetic analysis by identifying genes that track species rather than lineage-specific innovations. This framework has been foundational in reconstructing evolutionary trees, as orthologous sequences allow inference of divergence times and relationships without confounding effects from duplications. By focusing on vertical , orthology provides a robust basis for understanding functional conservation and across species.

Detection Methods and Databases

Detection of orthologous sequences relies on computational methods that leverage sequence similarity, genomic context, and evolutionary history. One widely used approach is the reciprocal best BLAST hit (RBH) method, which identifies potential orthologs by performing pairwise sequence alignments and selecting pairs where each is the top match for the other across , often applying similarity thresholds such as an E-value below 10^{-5} to filter initial hits. Synteny analysis complements this by evaluating the conservation of order and chromosomal location between , helping to distinguish orthologs from paralogs in regions of preserved genomic . Phylogenetic tree reconciliation provides a more rigorous framework, constructing phylogenies and comparing them to the tree to infer duplication and events, thereby confirming orthology through evolutionary consistency. For de novo orthology inference without relying on pre-existing annotations, tools like OrthoMCL employ graph-based clustering algorithms, such as the Markov Cluster algorithm, to group homologous proteins into orthogroups based on all-against-all similarity searches. Similarly, QuartetS uses a quartet decomposition strategy to rapidly assess orthology by examining evolutionary signals in subsets of four taxa, offering scalability for large datasets of prokaryotic and eukaryotic genomes. Statistical models enhance these predictions by applying likelihood ratios to evaluate the probability of orthology versus paralogy, incorporating parameters like sequence divergence and branch lengths in phylogenetic contexts. Key databases facilitate access to precomputed orthologs. Ensembl Compara generates orthology predictions through tree construction and reconciliation across over 200 species, with the 2025 release expanding coverage to include additional non-vertebrate lineages and integrating updated assemblies. OrthoDB provides hierarchical orthology annotations spanning eukaryotes, prokaryotes, and viruses, with the 2024 update incorporating over 20,000 genomes and emphasizing functional conservation across taxonomic levels. eggNOG offers orthologous groups refined by phylogenetic profiles, covering 12,535 organisms with functional annotations. Despite these advances, challenges persist in orthology detection, particularly with incomplete genome assemblies that can lead to missed orthologs or erroneous assignments, and events that blur boundaries by introducing non-orthologous similarities. Accuracy is typically assessed using sensitivity, which measures the proportion of true orthologs correctly identified, and specificity, the proportion of non-orthologs correctly excluded, with top methods achieving around 90% sensitivity and 95% specificity on benchmark datasets. Recent developments incorporate to improve scalability and precision in phylogenomics. For instance, OrthoFinder's 2025 update employs advanced orthogroup inference with integrated gene tree rooting, enabling accurate orthology detection across thousands of in hours while reducing biases in large-scale analyses.

Paralogy

Definition and Mechanisms

Paralogs are homologous genes related by duplication within a single or species, distinguishing them from orthologs that arise through events. The term "paralogy" was introduced by Walter M. Fitch in 1970 to describe the divergence of gene copies following duplication, contrasting it with orthology. This definition emphasizes that paralogs share a common via duplication rather than vertical descent from a speciation event. Gene duplication, the primary mechanism generating paralogs, occurs through several processes that copy genetic material within the . Tandem duplications produce adjacent copies, often resulting from during , and are common for families involved in rapid . Segmental duplications involve larger chromosomal regions, sometimes spanning megabases, and contribute to genomic instability and copy number variations. Whole-genome duplications (WGD), or events, duplicate the entire at once, leading to widespread paralog formation, particularly in and ancient vertebrate lineages. High sequence similarity between paralogs often serves as evidence for relatively recent duplication events, as divergence accumulates over time. In the , approximately 70.5% of protein-coding genes have at least one paralog, reflecting the prevalence of duplication events across evolutionary history. Following duplication, paralogs can evolve through distinct fates: nonfunctionalization, where one copy accumulates deleterious mutations and becomes a ; subfunctionalization, in which the ancestral functions partition between copies, reducing redundancy; or neofunctionalization, where one copy acquires a novel function while the other retains the original role. These outcomes are not mutually exclusive and depend on selective pressures, with nonfunctionalization being the most frequent immediate fate for many duplicates. Prominent examples of paralogous families include the genes, which arose from ancient and segmental duplications in early vertebrates, enabling specialized oxygen transport roles such as fetal versus adult . Similarly, the gene family in humans expanded through repeated and segmental duplications, resulting in over 400 functional paralogs that detect diverse odorants, though many copies have pseudogenized over time.

Gene Regulation in Paralogs

Paralogous genes often exhibit regulatory divergence following duplication, where changes in cis-regulatory elements such as promoters and enhancers lead to alterations in binding and subsequent tissue-specific or spatiotemporal expression patterns. This divergence allows paralogs to partition ancestral functions or acquire novel regulatory roles, contributing to ary innovation without disrupting essential gene networks. Two primary mechanisms underpin this regulatory evolution in paralogs: dosage balance and escape from adaptive conflict. The dosage balance hypothesis posits that after whole-genome duplication (WGD), paralogs involved in stoichiometric complexes are retained to maintain balanced levels, with regulatory adjustments preventing dosage imbalances that could impair macromolecular assembly. This selective pressure favors coordinated expression divergence, where paralogs evolve complementary regulatory patterns to preserve network stability. In contrast, escape from adaptive conflict occurs when a pre-duplication faces pleiotropic constraints from performing multiple suboptimal functions; post-duplication, each paralog can specialize, resolving the conflict through distinct regulatory adaptations that optimize individual roles. These mechanisms often interplay, with dosage constraints initially stabilizing duplicates before adaptive divergence refines their expression. A prominent example of such regulatory divergence is seen in paralogs, which arose from ancient duplications and display spatiotemporally distinct expression during development. In s, the four Hox clusters (HoxA-D) contain paralogous genes that have diverged in their anterior-posterior patterning roles; for instance, HoxC paralogs exhibit enhancer-driven differences that correlate with evolutionary shifts in limb and craniofacial development, where shared regulatory elements encode disparate specificities across paralogs. This specialization enables precise spatiotemporal control, such as sequential activation along the body axis, contrasting with more uniform ancestral expression. Comparative studies in fish like medaka and further highlight how Hox paralog group 3-6 genes diverge spatially in the and pharyngeal arches from their mammalian orthologs, underscoring regulatory evolution post-duplication. Functionally, regulatory divergence in paralogs balances and specialization, where initial overlap provides buffering against perturbations, but subsequent changes promote sub- or neofunctionalization. In experiments, paralog compensation often reveals this , as the remaining duplicate upregulates to maintain , yet specialized paralogs show tissue-specific vulnerabilities when disrupted. This dynamic supports evolutionary robustness, with dosage-balanced paralogs in complexes like transcription factors exhibiting higher retention and finer-tuned expression to avoid imbalances. Comparative epigenomic studies reveal that differential contributes to these regulatory patterns in paralogs, particularly those from WGD events. In polyploid species like tetraploid , paralog pairs from distinct WGDs display asymmetric at promoters and gene bodies, correlating with expression divergence; hypomethylated paralogs often show higher transcription in specific tissues, while hypermethylation silences others, thus facilitating subfunctionalization. Such epigenetic modifications, including variants, provide a heritable layer of that evolves rapidly post-duplication, influencing paralog fate without changes.

Paralogous Regions and Ohnology

Paralogous chromosomal regions arise from large-scale segmental duplications, which create syntenic blocks of duplicated DNA sequences within the genome. These duplications, often involving low-copy repeats (LCRs) with high sequence identity (>90%), mediate non-allelic homologous recombination and contribute to genomic instability. A prominent example is the 22q11.2 region on human chromosome 22, which contains eight LCRs (LCR22-A to LCR22-H) that form paralogous segments prone to rearrangements, leading to disorders such as 22q11.2 deletion syndrome. These regions maintain collinearity, preserving gene order and facilitating evolutionary studies of duplication events. Ohnology refers to the study of ohnologs, paralogous genes resulting from whole-genome duplications (WGDs), a term coined by Kenneth H. Wolfe in to honor Susumu Ohno's contributions to research. Ohnologs emerge following polyploidization events, where entire genomes are duplicated, leading to redundant copies that may diverge or be retained for functional innovation. In s, two ancient WGD events (2R) occurred in the early lineage, producing ohnolog pairs across species; for instance, these events duplicated key developmental genes, contributing to the complexity of physiology. fish experienced an additional 3R WGD approximately 350 million years ago, generating extensive ohnolog families that expanded repertoires for adaptations like diverse body plans and sensory systems. Ohnologs play a pivotal role in evolutionary innovation by providing genetic material for subfunctionalization or neofunctionalization, enhancing traits such as signaling pathways in vertebrates. Approximately 30% of human protein-coding genes are ohnologs from the 2R events, and these are dosage-balanced but frequently associated with due to their sensitivity to perturbations. For example, ohnologs are enriched in genes linked to genetic disorders, with studies indicating their overrepresentation in pathogenic copy number variations and conditions like . Detection of paralogous regions and ohnologs relies on identifying synteny and , where conserved order and spacing indicate duplication history. Tools like MCScanX facilitate this by scanning genomes for collinear blocks, aligning multiple sequences, and quantifying duplication events, enabling the distinction of WGD-derived ohnologs from smaller-scale paralogs. This approach has been instrumental in mapping ohnolog families across genomes, revealing retention patterns post-WGD.

Other Forms of Homology

Xenology

Xenology describes a form of sequence homology arising from (HGT), in which homologous genes or gene fragments are exchanged between distantly related species, resulting in xenologs that share a common ancestor but diverge through non-vertical inheritance rather than or duplication. Unlike orthologs or paralogs, xenologs reflect lateral movement of genetic material across phylogenetic boundaries, often complicating traditional homology classifications. This process is particularly prominent in prokaryotes, where HGT drives plasticity and evolutionary innovation. The primary mechanisms of HGT leading to xenology in prokaryotes include conjugation, transduction, and transformation. Conjugation facilitates direct transfer of DNA, typically via plasmids, between bacterial cells in close contact, enabling the spread of mobile genetic elements. Transduction occurs when bacteriophages inadvertently package and deliver host DNA to another bacterium during viral infection, while transformation involves the uptake of naked DNA from the environment by competent cells. These mechanisms contribute to xenology's prevalence, with estimates indicating that 1–20% of genes in bacterial genomes originate from HGT events; for instance, analyses of 24 complete prokaryotic genomes revealed horizontally transferred genes comprising 1.56% to 14.47% of each genome, higher in archaea and nonpathogenic bacteria. Evolutionarily, xenology plays a key role in by allowing organisms to rapidly acquire beneficial traits from unrelated lineages, such as genes conferring resistance to antibiotics or toxins in changing environments. This lateral exchange accelerates evolutionary rates compared to alone, promoting survival in niches like host-associated or stressful habitats. Notable examples include type IV secretion systems (T4SS) in , which are xenologous gene clusters transferred via HGT to facilitate conjugation and effector protein delivery, enhancing pathogenicity and genetic exchange across diverse taxa. In eukaryotes, xenology manifests through endosymbiotic transfer, where bacterial genes from organelles like mitochondria or chloroplasts are relocated to the host nucleus, as seen in photosynthetic lineages where up to thousands of such genes have been integrated over evolutionary time. Detecting xenologs presents challenges due to their similarity to orthologs in sequence, requiring methods beyond mere alignment to distinguish HGT from vertical descent. Initial identification often relies on sequence similarity to flag potential candidates, but confirmation hinges on phylogenetic incongruence, where the topology of the gene tree deviates significantly from the species tree, signaling lateral transfer events. Advanced approaches, such as parametric methods assessing atypical nucleotide composition or nonparametric reconciliation, further aid in resolving these ambiguities, though false positives from incomplete lineage sorting can complicate inference.

Homoeology

Homoeologs are defined as paralogous genes or chromosomes within a that originated from events followed by allopolyploidization, retaining sufficient similarity to allow partial synaptic pairing during despite their divergence. This distinguishes them from strict orthologs or simpler paralogs, as they arise specifically from the merging of divergent genomes through hybridization and subsequent genome doubling, often in where polyploidy is prevalent. In this context, homoeologs represent a specialized form of paralogy tied to ancient whole-genome duplications (WGDs), and they fall under the broader framework of ohnology, which encompasses paralogs retained from WGD events across eukaryotes. Homoeology is particularly common in plant polyploids, where it structures the into distinct subgenomes derived from ancestral . For instance, hexaploid bread wheat (Triticum aestivum) consists of A, B, and D subgenomes from three diploid progenitors that diverged 2.5–6 million years ago, with the AABB tetraploid forming approximately 0.5–1 million years ago and the final AABBDD hexaploid approximately 8,000–10,000 years ago. Each or in one subgenome thus has two homoeologous counterparts in the others, contributing to the ' genetic redundancy and adaptability. Similarly, allopolyploid cotton like upland cotton () feature At and Dt subgenomes from an African-Asian diploid hybrid and an American diploid, respectively, with homoeologous that have undergone subfunctionalization since their formation 1–2 million years ago. In paleopolyploids like (Glycine max), homoeologous regions trace back to two ancient WGDs around 59 and 130 million years ago, resulting in 20 pairs with extensive homeologous synteny that influences content and expression. Evolutionarily, homoeologs are stabilized in polyploids through mechanisms that suppress homeologous recombination, preventing deleterious chromosomal rearrangements and ensuring disomic . In , the Ph1 locus on 5B plays a key role in this suppression by promoting preferential pairing of true homologs over homoeologs during . This regulation is crucial for meiotic stability but can be partially overridden, allowing limited homeologous exchanges that drive . Such dynamics also underpin hybrid vigor () in polyploids, where nonadditive interactions between homeologous transcripts—often involving epigenetic modifications and dosage effects—enhance traits like and stress tolerance beyond parental levels. The agricultural significance of homoeology lies in its exploitation for crop improvement, as polyploid crops like and account for a substantial portion of global food production, and manipulating homeologous can yield resilient varieties with superior yield or quality. In breeding programs, identifying and targeting homeologous variants enables the of beneficial alleles while minimizing instability from recombination. In humans, while not polyploid, ancient vertebrate WGD-derived ohnologs (analogous to homeologs) exhibit dosage sensitivity, contributing to disease associations such as segmental aneuploidies in conditions like , where disrupts balanced expression of duplicated genes.

Gametology

Gametologs refer to homologous located on , such as the X and Y in mammals or Z and W in birds, that have diverged following the suppression of recombination between these . This divergence typically begins when a sex-determining , like SRY on the mammalian Y, evolves on one , leading to the cessation of genetic exchange and the initiation of independent evolutionary trajectories for the paired . As a result, gametologs represent a specialized form of sequence homology shaped by sex-specific selective pressures, distinct from autosomal paralogs that arise from non-sex-linked duplications. The primary mechanism driving gametolog divergence is the degeneration of the non-recombining sex chromosome, often the Y or W, through processes like , where deleterious mutations accumulate due to the lack of recombination to purge them. In mammals, this leads to progressive gene loss on the , with only about 3% of ancestral genes retained. A classic example is the amelogenin genes AMELX on the and AMELY on the in humans, which encode proteins but show sequence divergence post-recombination suppression, with AMELY exhibiting reduced functionality due to Y degeneration. Evolutionary patterns include accelerated evolution on the degenerating chromosome; for instance, Y-linked gametologs evolve up to four times faster than X-linked counterparts due to relaxed purifying selection and exposure to male-specific mutation rates. To counter dosage imbalances from Y gene loss, mammals employ , silencing one X in females via the to equalize expression between XX females and XY males. In birds, gametologs on the Z (male) and W (female) chromosomes follow similar divergence patterns, though dosage compensation often involves Z upregulation in males rather than inactivation. The chromo-helicase-DNA-binding (CHD) genes provide a well-studied avian example, with on the Z chromosome and on the W chromosome diverging approximately 123 million years ago, enabling their use in sex determination and phylogenetic rooting. Homology is partially retained in pseudoautosomal regions (PARs), short telomeric segments where recombination persists, facilitating proper meiotic pairing; in humans, PAR1 spans about 2.7 Mb and contains genes like SHOX that escape degeneration. Recent 2020s studies have revealed dynamic reversals in non-mammalian Y-linked gene evolution, including frequent sex chromosome turnovers that mitigate degeneration. In African clawed frogs (), independent Y-like chromosome evolutions across species show rapid turnover, with some Y-linked genes regained via autosomal translocations, preventing complete loss and highlighting adaptive flexibility in sex chromosome systems.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.