Hubbry Logo
search
logo

Overlapping gene

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

An overlapping gene (or OLG)[1][2] is a gene whose expressible nucleotide sequence partially overlaps with the expressible nucleotide sequence of another gene.[3] In this way, a nucleotide sequence may make a contribution to the function of one or more gene products. Overlapping genes are present in and a fundamental feature of both cellular and viral genomes.[2] The current definition of an overlapping gene varies significantly between eukaryotes, prokaryotes, and viruses.[2] In prokaryotes and viruses overlap must be between coding sequences but not mRNA transcripts, and is defined when these coding sequences share a nucleotide on either the same or opposite strands. In eukaryotes, gene overlap is almost always defined as mRNA transcript overlap. Specifically, a gene overlap in eukaryotes is defined when at least one nucleotide is shared between the boundaries of the primary mRNA transcripts of two or more genes, such that a DNA base mutation at any point of the overlapping region would affect the transcripts of all genes involved. This definition includes 5′ and 3′ untranslated regions (UTRs) along with introns.

Overprinting refers to a type of overlap in which all or part of the sequence of one gene is read in an alternate reading frame from another gene at the same locus.[4] The alternative open reading frames (ORF) are thought to be created by critical nucleotide substitutions within an expressible pre-existing gene, which can be induced to express a novel protein while still preserving the function of the original gene.[5] Overprinting has been hypothesized as a mechanism for de novo emergence of new genes from existing sequences, either older genes or previously non-coding regions of the genome.[6] It is believed that most overlapping genes, or genes whose expressible nucleotide sequences partially overlap with each other, evolved in part due to this mechanism, suggesting that each overlap is composed of one ancestral gene and one novel gene.[7] Subsequently, overprinting is also believed to be a source of novel proteins, as de novo proteins coded by these novel genes usually lack remote homologs in databases.[8] Overprinted genes are particularly common features of the genomic organization of viruses, likely to greatly increase the number of potential expressible genes from a small set of viral genetic information.[9] It is likely that overprinting is responsible for the generation of numerous novel proteins by viruses over the course of their evolutionary history.

Classification

[edit]
Tandem out-of-phase overlap of the human mitochondrial genes ATP8 (+1 frame, in red) and ATP6 (+3 frame, in blue)[10]

Genes may overlap in a variety of ways and can be classified by their positions relative to each other.[3][11][12][13][14]

  • Unidirectional or tandem overlap: the 3' end of one gene overlaps with the 5' end of another gene on the same strand. This arrangement can be symbolized with the notation → → where arrows indicate the reading frame from start to end.
  • Convergent or end-on overlap: the 3' ends of the two genes overlap on opposite strands. This can be written as → ←.
  • Divergent or tail-on overlap: the 5' ends of the two genes overlap on opposite strands. This can be written as ← →.

Overlapping genes can also be classified by phases, which describe their relative reading frames:[3][11][12][13][14]

  • In-phase overlap occurs when the shared sequences use the same reading frame. This is also known as "phase 0". Unidirectional genes with phase 0 overlap are not considered distinct genes, but rather as alternative start sites of the same gene.
  • Out-of-phase overlaps occurs when the shared sequences use different reading frames. This can occur in "phase 1" or "phase 2", depending on whether the reading frames are offset by 1 or 2 nucleotides. Because a codon is three nucleotides long, an offset of three nucleotides is an in-phase, phase 0 frame.

Studies on overlapping genes suggest that their evolution can be summarized in two possible models.[4] In one model, the two proteins encoded by their respective overlapping genes evolve under similar selection pressures. The proteins and the overlap region are highly conserved when strong selection against amino acid change is favored. Overlapping genes are reasoned to evolve under strict constraints as a single nucleotide substitution is able to alter the structure and function of the two proteins simultaneously. A study on the hepatitis B virus (HBV), whose DNA genome contains numerous overlapping genes, showed the mean number of synonymous nucleotide substitutions per site in overlapping coding regions was significantly lower than that of non-overlapping regions.[15] The same study showed that it was possible for some of these overlapping regions and their proteins to diverge significantly from the original when there's weak selection against amino acid change. The spacer domain of the polymerase and the pre-S1 region of a surface protein of HBV, for example, had a percentage of conserved amino acids of 30% and 40%, respectively.[15] However, these overlap regions are known to be less important for replication compared to the overlap regions that were highly conserved among different HBV strains, which are absolutely essential for the process.

The second model suggests that the two proteins and their respective overlap genes evolve under opposite selection pressures: one frame experiences positive selection while the other is under purifying selection. In tombusviruses, the proteins p19 and p22 are encoded by overlapping genes that form a 549 nt coding region, and p19 is shown to be under positive selection while p22 is under purifying selection.[16] Additional examples are mentioned in studies involving overlapping genes of the Sendai virus,[17] potato leafroll virus,[18] and human parvovirus B19.[19] This phenomenon of overlapping genes experiencing different selection pressures is suggested to be a consequence of a high rate of nucleotide substitution with different effects on the two frames; the substitutions may be majorly non-synonymous for one frame while mostly being synonymous for the other frame.[4]

Evolution

[edit]

Overlapping genes are particularly common in rapidly evolving genomes, such as those of viruses, bacteria, and mitochondria. They may originate in three ways:[20]

  1. By extension of an existing open reading frame (ORF) downstream into a contiguous gene due to the loss of a stop codon;
  2. By extension of an existing ORF upstream into a contiguous gene due to loss of an initiation codon;
  3. By generation of a novel ORF within an existing one due to a point mutation.

The use of the same nucleotide sequence to encode multiple genes may provide evolutionary advantage due to reduction in genome size and due to the opportunity for transcriptional and translational co-regulation of the overlapping genes.[12][21][22][23] Gene overlaps introduce novel evolutionary constraints on the sequences of the overlap regions.[14][24]

Origins of new genes

[edit]
A cladogram indicating the likely evolutionary trajectory of the gene-dense pX region in human T-lymphotropic virus 1 (HTLV1), a deltaretrovirus associated with blood cancers. This region contains numerous overlapping genes, several of which likely originated de novo through overprinting.[9]

In 1977, Pierre-Paul Grassé proposed that one of the genes in the pair could have originated de novo by mutations to introduce novel ORFs in alternate reading frames; he described the mechanism as overprinting.[25]: 231  It was later substantiated by Susumu Ohno, who identified a candidate gene that may have arisen by this mechanism.[26] Some de novo genes originating in this way may not remain overlapping, but subfunctionalize following gene duplication,[6] contributing to the prevalence of orphan genes. Which member of an overlapping gene pair is younger can be identified bioinformatically either by a more restricted phylogenetic distribution, or by less optimized codon usage.[9][27][28] Younger members of the pair tend to have higher intrinsic structural disorder than older members, but the older members are also more disordered than other proteins, presumably as a way of alleviating the increased evolutionary constraints posed by overlap.[27] Overlaps are more likely to originate in proteins that already have high disorder.[27]

Taxonomic distribution

[edit]
Overlapping genes in the bacteriophage ΦX174 genome. There are 11 genes in this genome (A, A*, B-H, J, K). Genes B, K, E overlap with genes A, C, D.[29]

Overlapping genes occur in all domains of life, though with varying frequencies. They are especially common in viral genomes.

Viruses

[edit]
The RNA silencing suppressor p19 from tomato bushy stunt virus, a protein encoded by an overprinted gene. The protein specifically binds siRNAs produced as part of the plant's RNA silencing defense against viruses.[30]

The existence of overlapping genes was first identified in the virus ΦX174, whose genome was the first DNA genome ever sequenced by Frederick Sanger in 1977.[29] Previous analysis of ΦX174, a small single-stranded DNA bacteriophage that infected the bacteria Escherichia coli, suggested that the proteins produced during infection required coding sequences longer than the measured length of its genome.[31] Analysis of the fully sequenced 5386 nucleotide genome showed that the virus possessed extensive overlap between coding regions, revealing that some genes (like genes D and E) were translated from the same DNA sequences but in different reading frames.[29][31] An alternative start site within the genome replication gene A of ΦX174 was shown to express a truncated protein with an identical coding sequence to the C-terminus of the original A protein but possessing a different function[32][33] It was concluded that other undiscovered sites of polypeptide synthesis could be hidden through the genome due to overlapping genes. An identified de novo gene of another overlapping gene locus was shown to express a novel protein that induces lysis of E. coli by inhibiting biosynthesis of its cell wall[56], suggesting that de novo protein creation through the process of overprinting can be a significant factor in the evolution of pathogenicity of viruses.[4] Another example is the ORF3d gene in the SARS-CoV 2 virus.[1][34] Overlapping genes are particularly common in viral genomes.[9] Some studies attribute this observation to selective pressure toward small genome sizes mediated by the physical constraints of packaging the genome in a viral capsid, particularly one of icosahedral geometry.[35] However, other studies dispute this conclusion and argue that the distribution of overlaps in viral genomes is more likely to reflect overprinting as the evolutionary origin of overlapping viral genes.[36] Overprinting is a common source of de novo genes in viruses.[28]

The proportion of viruses with overlapping coding sequences within their genomes varies.[2] Double-stranded RNA viruses have fewer than a quarter that contains them while almost three-quarters of retroviridae and viruses with single-stranded DNA genomes contain overlapping coding sequences.[37] Segmented viruses in particular, or viruses with their genome split into separate pieces and packaged either all in the same capsid or in separate capsids, are more likely to contain an overlapping sequence than non-segmented viruses.[37] RNA viruses have fewer overlapping genes than DNA viruses which possess lower mutation rates and less restrictive genome sizes.[37][38] The lower mutation rate of DNA viruses facilitates greater genomic novelty and evolutionary exploration within a structurally constrained genome and may be the primary driver of the evolution of overlapping genes.[39][40]

Studies of overprinted viral genes suggest that their protein products tend to be accessory proteins which are not essential to viral proliferation, but contribute to pathogenicity. Overprinted proteins often have unusual amino acid distributions and high levels of intrinsic disorder.[41] In some cases overprinted proteins do have well-defined, but novel, three-dimensional structures;[42] one example is the RNA silencing suppressor p19 found in Tombusviruses, which has both a novel protein fold and a novel binding mode in recognizing siRNAs.[28][30][43]

Prokaryotes

[edit]

Estimates of gene overlap in bacterial genomes typically find that around one third of bacterial genes are overlapped, though usually only by a few base pairs.[12][44][45] Most studies of overlap in bacterial genomes find evidence that overlap serves a function in gene regulation, permitting the overlapped genes to be transcriptionally and translationally co-regulated.[12][23] In prokaryotic genomes, unidirectional overlaps are most common, possibly due to the tendency of adjacent prokaryotic genes to share orientation.[12][14][11] Among unidirectional overlaps, long overlaps are more commonly read with a one-nucleotide offset in reading frame (i.e., phase 1) and short overlaps are more commonly read in phase 2.[45][46] Long overlaps of greater than 60 base pairs are more common for convergent genes; however, putative long overlaps have very high rates of misannotation.[47] Robustly validated examples of long overlaps in bacterial genomes are rare; in the well-studied model organism Escherichia coli, only four gene pairs are well validated as having long, overprinted overlaps.[48]

Eukaryotes

[edit]

Compared to prokaryotic genomes, eukaryotic genomes are often poorly annotated and thus identifying genuine overlaps is relatively challenging.[28] However, examples of validated gene overlaps have been documented in a variety of eukaryotic organisms, including mammals such as mice and humans.[49][50][51][52] Eukaryotes differ from prokaryotes in distribution of overlap types: while unidirectional (i.e., same-strand) overlaps are most common in prokaryotes, opposite or antiparallel-strand overlaps are more common in eukaryotes. Among the opposite-strand overlaps, convergent orientation is most common.[50] Most studies of eukaryotic gene overlap have found that overlapping genes are extensively subject to genomic reorganization even in closely related species, and thus the presence of an overlap is not always well-conserved.[51][53] Overlap with older or less taxonomically restricted genes is also a common feature of genes likely to have originated de novo in a given eukaryotic lineage.[51][54][55]

Function

[edit]

The precise functions of overlapping genes seems to vary across the domains of life but several experiments have shown that they are important for virus lifecycles through proper protein expression and stoichiometry [56] as well as playing a role in proper protein folding.[57] A version of bacteriophage ΦX174 has also been created where all gene overlaps were removed [58] proving they were not necessary for replication.

The retention and evolution of overlapping genes within viruses may also be due to capsid size limitations.[59] Dramatic viability loss was observed in viruses with genomes engineered to be longer than the wild-type genome.[60] Increasing the single-stranded DNA genome length of ΦX174 by >1% results in almost complete loss of infectivity, believed to be the result of the strict physical constraints imposed by the finite capsid volume.[61] Studies on adeno-associated viruses as gene delivery vectors showed that viral packaging is constrained by genetic cargo size limits, requiring the use of multiple vectors to deliver large human genes such as CFTR81.[62][63] Therefore, it is suggested that overlapping genes evolved as a means to overcome these physical constraints, increasing genetic diversity by utilizing only the existing sequence rather than increasing genome length.

Methods in identifying overlapping genes and ORFs

[edit]

Standardized methods such as genome annotation may be inappropriate for the detection of overlapping genes as they are reliant on already curated genes while overlapping genes are generally overlooked contain atypical sequence composition.[2][64][65][66] Genome annotation standards are also often biased against feature overlaps, such as genes entirely contained within another gene.[67] Furthermore, some bioinformatics pipelines such as the RAST pipeline markedly penalizes overlaps between predicted ORFs.[68] However, rapid advancement of genome-scale protein and RNA measurement tools along with increasingly advanced prediction algorithms have revealed an avalanche of overlapping genes and ORFs within numerous genomes.[2] Proteogenomic methods have been essential in discovering numerous overlapping genes and include a combination of techniques such as bottom-up proteomics, ribosome profiling, DNA sequencing, and perturbation. RNA sequencing is also used to identify genomic regions containing overlapping transcripts. It has been utilized to identify 180,000 alternate ORFs within previously annotated coding regions found in humans.[69] Newly discovered ORFs such as these are verified using a variety of reverse genetics techniques, such as CRISPR-Cas9 and catalytically dead Cas9 (dCas9) disruption.[70][71][72] Attempts at proof-by-synthesis are also performed to show beyond doubt the absence of any undiscovered overlapping genes.[73]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
An overlapping gene is a genomic region in DNA or RNA that encodes multiple distinct proteins or functional RNAs by sharing nucleotide sequences, typically through translation in different reading frames, use of alternative initiation codons, or transcription from complementary strands.[1] These genes are defined by the overlap of at least one nucleotide between the coding sequences (CDSs) of two or more genes, enabling efficient use of genetic material in compact genomes.[1] Overlapping genes can be unidirectional (same transcriptional direction), bidirectional (opposite directions), or nested (one gene embedded within another), and they occur across viruses, prokaryotes, and eukaryotes.[1] Overlapping genes are particularly prevalent in viral genomes, where they facilitate extreme compaction; for instance, the bacteriophage phiX174 contains multiple overlapping genes within its small 5,386-base-pair genome, a discovery that highlighted their role in encoding up to 11 proteins from limited sequence.[1] In prokaryotes like bacteria, short overlaps of 1–5 nucleotides are common, especially in phase 2 of bacterial genomes, and they contribute to mutational robustness by allowing coordinated expression of related functions.[2] Eukaryotic genomes also harbor overlapping genes, though they are less frequent and often involve tail-to-tail (66.42%) or head-to-head (30.81%) configurations in humans, with hundreds of such pairs identified in mammalian species.[2] Notable eukaryotic examples include the overlapping tRNA genes in the human mitochondrial genome, such as tRNAIle and tRNAGln sharing a 3-nucleotide sequence (5′-CTA-3′ and 5′-TAG-3′), and nested genes in Drosophila where one gene resides within an intronic region of another.[2] Evolutionarily, overlapping genes promote genome efficiency by maximizing coding density and providing flexibility for de novo gene emergence, often under positive or purifying selection to maintain functional integration.[1] Proteins encoded by overlapping genes exhibit distinct sequence compositions compared to those from non-overlapping genes, with enrichment in high-degeneracy amino acids like arginine, serine, and proline, and depletions in low-degeneracy ones, which may reduce evolutionary constraints.[3] In viruses such as SARS-CoV-2, overlapping genes like ORF3a evolve dynamically to enhance adaptability, underscoring their role in rapid viral evolution.[1] While only six experimentally verified mammalian overlapping protein-coding genes exist, ribosome profiling techniques like Ribo-seq have revealed many more alternative open reading frames (ORFs) in eukaryotic transcriptomes, suggesting underappreciated prevalence.[3]

Definition and Classification

Definition

An overlapping gene is a genomic feature where the coding sequences (CDS) of two or more genes share at least one nucleotide in the DNA or RNA sequence, enabling the production of multiple proteins or functional RNAs from the same stretch of genetic material.[1] This arrangement contrasts with non-overlapping genes, whose CDS are entirely distinct and do not share nucleotides, thereby allowing overlaps to achieve greater informational density without relying on pseudogenes or non-coding regulatory elements.[3] Such sharing is particularly prevalent in genomes with limited space, such as those of viruses and prokaryotes.[4] Overlaps can occur on the same DNA strand, known as sense overlaps, where genes are translated in different reading frames—typically shifted by +1 or +2 nucleotides relative to each other—or on opposite strands, referred to as antisense overlaps, where one gene is encoded on the complementary strand.[5] In sense overlaps, the frame shift ensures that the codons for each protein are distinct despite the nucleotide overlap, while antisense overlaps involve transcription from both strands, often producing complementary RNA molecules.[1] These configurations allow for the simultaneous expression of multiple functional products from a single genomic locus.[6] The concept of overlapping genes was first identified during the sequencing of the bacteriophage φX174 genome in 1977, where multiple protein-coding regions were found to share nucleotides within the compact viral DNA.[7] This discovery, achieved through pioneering DNA sequencing methods, initially identified nine proteins encoded within the approximately 5,375-nucleotide genome through such overlaps; subsequent refinements determined the genome size as 5,386 bp encoding 11 proteins.[8][9]

Types of Overlaps

Overlapping genes can be classified by their topological arrangements, which describe how their coding sequences (CDS) or expressible regions interact spatially on the DNA strand. Fully overlapping genes share their entire CDS, meaning one gene's sequence completely encompasses the other's, often resulting in two proteins encoded from the same nucleotides but in different reading frames. Partially overlapping genes share only a subset of their CDS, typically at the 5' or 3' ends, allowing for independent transcription starts or stops while conserving a common region. Nested overlaps occur when one gene is entirely contained within the boundaries of another, such as an internal open reading frame (ORF) embedded in the introns or exons of a host gene. Convergent overlaps feature genes on opposite strands that share their 3' ends (tail-to-tail arrangement, →←), while divergent overlaps share their 5' ends (head-to-head, ←→), enabling bidirectional transcription from a common promoter region.[10][11][12] Overlaps can further be distinguished by whether they involve protein-coding or RNA-coding elements. Protein-coding overlaps typically utilize alternative open reading frames (altORFs), where a secondary CDS is embedded within the primary gene's sequence, often shifted by one or two nucleotides to avoid the main frame's stop codons and produce a distinct polypeptide. In contrast, RNA-coding overlaps involve non-protein-coding transcripts, such as microRNAs (miRNAs) or long non-coding RNAs (lncRNAs), that are transcribed from regions overlapping protein-coding genes, for instance, miRNAs hosted within the introns of a primary gene without altering its CDS. These distinctions highlight how overlaps can encode both translated products and regulatory RNAs from shared genomic space.[1][13][14] The extent of overlap varies widely, ranging from as little as 1 nucleotide to the full length of the shorter gene, influencing the degree of sequence constraint between the genes. Frame differences are critical in same-strand overlaps, with common offsets of +1 or +2 nucleotides relative to the primary frame; for example, +1 frame overlaps are prevalent in viral genomes, where a downstream gene initiates one nucleotide after the upstream gene's start, maximizing coding density in compact sequences. These frame shifts ensure that start and stop codons align appropriately without premature termination in either ORF.[1][15][16] A notable example of an overlapping gene pair in humans is POLG, which encodes the catalytic subunit of mitochondrial DNA polymerase, and POLGARF, an alternative ORF that overlaps extensively with POLG's CDS in a +1 frame, initiating at a CUG codon within the primary mRNA to produce a distinct protein. This configuration demonstrates a sense-strand overlap where both genes are translated from the same transcript.[17]

Evolutionary Aspects

Origins and Mechanisms

Overlapping genes emerge primarily through overprinting, in which a new open reading frame (ORF) arises within the sequence of an existing gene by exploiting an alternative reading frame, enabling the same nucleotide stretch to encode multiple proteins.[18] This process often begins with point mutations that introduce a viable start codon or eliminate a stop codon in the secondary frame, as observed in viral genomes where compact organization favors such innovations.[15] Another key mechanism is gene duplication followed by a frameshift, where a copy of an existing gene undergoes a reading frame alteration—typically a +1 or +2 shift—leading to partial overlap with the ancestral gene while preserving partial sequence identity.[1] De novo emergence from non-coding regions involves successive mutations that generate a functional ORF overlapping adjacent genes, often activated by nearby regulatory elements. Additionally, retrotransposition contributes by integrating RNA-derived sequences into introns or exons of existing genes, creating nested or antisense overlaps, particularly in eukaryotic genomes.[1] Mutations play a central role in initiating and stabilizing these overlaps. Insertions and deletions (indels) frequently cause frameshifts that disrupt one reading frame while potentially creating a new functional ORF in an alternative frame, as seen in bacterial and viral evolution.[19] Once formed, these overlapping configurations are maintained by purifying selection, which eliminates deleterious variants but retains those conferring adaptive advantages, such as enhanced regulatory control or genome compaction.[20] In viruses, this selection is particularly stringent due to small genome sizes and high mutation rates, preserving overlaps that optimize coding capacity.[15] The historical recognition of overlapping genes began in the 1970s with early viral examples, such as the identification of multiple ORFs in the bacteriophage ΦX174 genome, marking the first confirmed case of dual-coding sequences. Prokaryotic overlaps were documented in the 1980s, with systematic reviews revealing their prevalence in bacterial operons, such as overlaps in termination and initiation sites. In eukaryotes, overlapping alternative ORFs (altORFs) gained prominence in the 2010s through proteomics and ribosome profiling, uncovering thousands of translated overlaps previously overlooked in annotation efforts.

Advantages and Constraints

Overlapping genes confer evolutionary advantages by enhancing evolvability in compact genomes, where a single nucleotide change can simultaneously alter proteins from multiple reading frames, enabling rapid acquisition of new functions. This dual-impact potential facilitates adaptation under selective pressure, as seen in viruses where overlaps allow for the emergence of multifunctional proteins without expanding genome size. For example, in HIV-1, the overlapping env and rev genes exhibit positive selection (dN/dS > 1 in certain motifs), promoting functional segregation that purges unfit genotype combinations and boosts viral population fitness. [21] [1] Another advantage is increased genetic robustness and protection against deleterious mutations, as overlaps distribute functional constraints across frames, making the genome more resilient to errors. This redundancy buffers against loss-of-function mutations, particularly in high-mutation-rate environments like viral genomes, where overlaps maintain essential activities through compensatory changes. Theoretical models demonstrate that such antiredundancy in overlaps heightens overall genomic stability by linking the fates of co-encoded proteins. Despite these benefits, overlapping genes face substantial constraints due to mutational interference, where alterations in shared sequences disrupt both genes, restricting independent divergence and imposing stricter evolutionary pressures. Reduced codon flexibility in overlaps often results in amino acid biases, as synonymous choices must satisfy dual coding requirements, limiting sequence space for optimization. Overlaps typically experience heightened purifying selection to conserve functionality, evidenced by lower nucleotide substitution rates in fourfold degenerate sites compared to non-overlapping regions (P < 0.001). [22] [1] Empirical studies highlight these trade-offs: in viruses like HIV, overlaps show signatures of positive selection (e.g., dN/dS > 1) driving adaptive evolution, while prokaryotic overlaps often display neutral dynamics or stronger purifying selection (dN/dS < 1), with high turnover rates indicating constraints on long-term persistence. In bacterial genomes, the prevalence of specific overlap phases (e.g., +2 frames) reflects directional selection minimizing interference, yet overall, overlaps evolve more slowly than non-overlapping genes due to these limitations. [19] [22]

Taxonomic Distribution

In Viruses

Overlapping genes are particularly prevalent in viral genomes, which are often highly compact to fit within constrained capsid sizes. An analysis of 5,976 reference viral genomes from the NCBI Virus database revealed that 53% contain at least one gene overlap exceeding 50 nucleotides, with single-stranded DNA viruses showing the highest prevalence at 65%, followed by double-stranded DNA viruses at 61% and positive-sense single-stranded RNA viruses at 43%. In extremely compact genomes, such as those of bacteriophage φX174—a single-stranded DNA virus with a 5,386 bp genome encoding 11 genes—overlaps affect eight of the genes, allowing maximal coding efficiency in a minimal space. Similarly, positive-sense single-stranded RNA viruses in the family Picornaviridae exhibit high overlap frequencies, contributing to their dense genomic organization. Notable examples include the human immunodeficiency virus (HIV-1), where the tat and rev genes overlap, enabling coordinated regulation of viral transcription and RNA export through multifunctional proteins. In severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the ORF1a and ORF1b open reading frames overlap via a -1 ribosomal frameshift mechanism, producing polyproteins that are cleaved into 16 nonstructural proteins essential for replication. Recent discoveries in the 2020s have identified additional novel overlapping genes in coronaviruses, such as ORF3d in SARS-CoV-2, which emerges from an alternative frame and may enhance viral fitness. These overlaps confer adaptive advantages in viruses, facilitating rapid evolution by enabling de novo gene creation without genome expansion and promoting immune evasion through multifunctional proteins that suppress host defenses. For instance, in asymmetric overlaps, the newer frame often evolves under positive selection to encode accessory proteins that boost pathogenicity, as seen in various RNA viruses where overlaps stabilize under purifying selection post-origin. This strategy allows viruses to maintain genetic stability while adapting to host pressures, with overlaps arising primarily through overprinting of existing genes.

In Prokaryotes

Overlapping genes are a common feature in prokaryotic genomes, with approximately one-third of all genes participating in overlaps across diverse bacterial and archaeal species.[23] These overlaps frequently occur at operon boundaries, facilitating coordinated regulation and expression. In bacterial genomes, such as those of Escherichia coli, short overlaps of 4–50 base pairs are prevalent, often involving essential operons for metabolic pathways.[1] A notable example is the menaquinone biosynthesis operon in E. coli, where three consecutive genes exhibit short stop-start overlaps that promote translational coupling, ensuring stoichiometric production of enzymes in the electron transport chain.[1] Similarly, in Bacillus subtilis, ribosomal protein genes like rpsO and neighboring loci display overlaps that support ribosome assembly by linking translation initiation of downstream genes to upstream termination.[24] These configurations are particularly abundant in minimal genomes, such as those of Mycoplasma pneumoniae, where overlaps constitute up to 10% of gene pairs, aiding genome compaction while maintaining essential functions.[25] Functionally, these overlaps enable coordinate expression in operons critical for metabolic pathways, such as amino acid synthesis and ribosomal biogenesis, by allowing ribosome re-initiation without dedicated intergenic regions.[1] In compact prokaryotic genomes, this arrangement imposes evolutionary constraints, favoring short overlaps to balance mutation rates and regulatory precision.[23] Advanced proteomic and ribosome profiling have revealed previously undetected long overlaps, such as a 603 bp nested overlap in the E. coli ompA gene with pH-regulated function.[26] These studies highlight overlaps' role in enhancing expression efficiency under nutrient limitation, particularly in fast-growing bacteria.[1]

In Eukaryotes

Overlapping genes are considerably rarer in eukaryotic genomes compared to prokaryotes and viruses, comprising less than 1% of protein-coding genes in humans when considering strictly translated overlapping reading frames, though broader definitions including antisense overlaps can reach up to 10% of genes in human and mouse genomes.[27] These overlaps frequently occur within untranslated regions (UTRs), introns, or through alternative splicing, allowing for dual or multiple protein production from a single locus without extensive coding sequence overlap. A prominent example is the human CDKN2A locus, which encodes the tumor suppressors p16^INK4a and p14^ARF through alternative promoters, splicing, and partially overlapping reading frames, enabling distinct cell cycle regulation functions. This configuration is conserved across mammals and exemplifies how eukaryotic overlaps integrate with splicing machinery to enhance regulatory complexity in larger genomes. Additional examples include nested arrangements, such as the human ribosomal protein gene RPL36A, which harbors an alternative overlapping reading frame (altORF) termed Alt-RPL36 within its coding sequence; this altORF produces a short peptide that modulates the PI3K-AKT-mTOR signaling pathway, influencing cellular responses to stress and nutrient availability.[28] In unicellular eukaryotes like yeast (Saccharomyces cerevisiae), altORFs are also prevalent, with ribosome profiling revealing dozens of functional upstream or overlapping ORFs in genes involved in stress responses, such as those within the HSP82 chaperone transcript, contributing to proteome diversity without disrupting the primary protein.[29] These cases highlight how eukaryotic overlaps often rely on non-AUG initiation codons or internal ribosome entry to enable translation of secondary products. Recent advances in proteogenomics have uncovered previously overlooked eukaryotic overlapping genes, exemplified by the 2020 discovery of POLGARF, a 260-amino-acid protein encoded by a conserved CUG-initiated overlapping ORF within the human POLG mRNA, which encodes the mitochondrial DNA polymerase gamma; this overlap, detected via ribosome profiling and mass spectrometry, suggests roles in extracellular signaling and evolved around 160 million years ago in placental mammals. Similar proteogenomic approaches in 2021 and beyond, including CRISPR-based functional screens, have identified hundreds of altORFs and overlapping genes across eukaryotic species. In plants, emerging studies using large-scale transcriptomic and foundation model analyses have begun to reveal potential overlapping transcripts in species like Arabidopsis thaliana, particularly cis-natural antisense pairs regulated by small RNAs, though comprehensive functional validation remains ongoing.[30] Despite these insights, the functions of most eukaryotic overlapping genes remain understudied, with many altORFs and secondary products lacking clear phenotypic roles beyond preliminary associations with stress adaptation or signaling. Overlaps like those in CDKN2A and RPL36A are increasingly linked to diseases, including various cancers, where disruptions in dual-gene expression contribute to tumorigenesis through loss of tumor suppression or pathway dysregulation. This scarcity of knowledge underscores the need for integrated multi-omics approaches to elucidate their contributions to eukaryotic biology and pathology.[31]

Biological Functions

Genome Compaction and Efficiency

Overlapping genes enable organisms to encode multiple proteins from shared nucleotide sequences, thereby maximizing coding capacity within constrained genomic space. This mechanism is particularly vital for viruses and other genome size-limited entities, where every nucleotide must contribute to essential functions. For instance, bacteriophage φX174 utilizes extensive overlaps across all three reading frames to produce 11 proteins from its compact 5,386-base-pair genome, demonstrating how overlaps expand protein output without proportional genome enlargement.[32][1] In terms of efficiency, overlapping genes significantly boost coding density. Engineered minimal genomes further illustrate this benefit; the synthetic bacterium JCVI-syn3.0, with its 531-kilobase-pair genome encoding 473 genes, relies on streamlined organization to achieve viability while minimizing size, highlighting overlaps' role in balancing compactness and functionality.[10][33] The biogenesis of these genes frequently involves polycistronic transcripts in prokaryotes and viruses, where a single mRNA molecule carries multiple open reading frames, facilitating coordinated translation and further enhancing efficiency. Translational coupling in such transcripts ensures that the expression of downstream genes is linked to upstream ones, reducing regulatory overhead and promoting rapid protein production in resource-limited environments.[10]

Regulatory and Protective Roles

Overlapping genes play crucial roles in regulating gene expression through mechanisms such as translational coupling, where the translation of an upstream open reading frame (ORF) influences the efficiency of downstream gene translation, often repressing it to fine-tune protein levels.[34] In bacteria, natural antisense RNAs derived from overlapping genes act as regulatory elements by base-pairing with target mRNAs, leading to degradation or translational inhibition that modulates stress responses and metabolic pathways.[35] Eukaryotic overlapping genes can employ alternative start sites within shared sequences to generate diverse protein isoforms, enhancing functional versatility without expanding genome size.[1] Antisense-mediated silencing is another key regulatory function, where transcripts from overlapping genes form double-stranded RNA hybrids that recruit chromatin-modifying complexes or RNA interference machinery to suppress sense gene expression.[36] For instance, in head-to-head overlapping configurations, RNA-DNA interactions from one transcript can block promoter access for the partner gene, providing a layer of transcriptional control observed in mammalian genomes.[37] Non-coding antisense transcripts overlapping protein-coding genes also serve as sponges for microRNAs, indirectly stabilizing target mRNAs and creating feedback loops that sustain gene expression during development.[38] In protective roles, overlapping genes buffer against loss-of-function mutations by creating dual-essential configurations, where disruption of one gene impairs the overlapping partner, thereby purging deleterious variants and maintaining genomic integrity in high-mutation environments like RNA viruses.[39] In begomoviruses, such overlaps in essential genes like AC1 reduce mutation accumulation rates, providing evolutionary robustness against error-prone replication.[39] Viral examples include potyviruses, where overlapping essential genes such as P3 and the polyprotein-encoded PIPO ensure that mutations lethal to one frame are constrained by pleiotropic effects on the other, preserving infectivity.[40] Bacterial overlapping genes contribute to anti-phage defense through toxin-antitoxin systems encoded as genes-within-genes, where the antitoxin ORF overlaps the toxin, enabling rapid activation upon infection to abort phage replication via host cell toxicity.[41] This nested arrangement allows precise spatiotemporal control, as the antitoxin neutralizes the toxin under normal conditions but permits toxin dominance during phage invasion, enhancing survival.[41] A prominent example of regulatory overlap is found in HIV-1, where the tat and rev genes overlap in multiple reading frames, coordinating viral gene expression and latency; mutations in the shared region disrupt Rev-mediated export of unspliced RNAs, promoting transcriptional silencing and reservoir persistence.[42] Recent studies in 2025 have leveraged AI to design synthetic overlapping genes for precise regulatory control, demonstrating that deep generative models can create functional overlaps in bacterial genomes that couple translation of multiple outputs, offering tunable synthetic circuits for biotechnology.[43] These AI-engineered overlaps mimic natural regulatory logic while bypassing evolutionary constraints, enabling applications in metabolic engineering.[44]

Identification and Analysis Methods

Computational Methods

Computational methods for identifying and annotating overlapping genes rely on algorithmic approaches that scan genomic sequences for open reading frames (ORFs) in multiple frames, quantify overlaps, and assess functional potential through sequence conservation or bias analysis. These in silico tools process nucleotide sequences to predict candidate overlapping genes (OLGs), often integrating alignment-based searches and statistical models to distinguish true overlaps from artifacts. Key software includes OLGenie, which estimates purifying selection coefficients (dN/dS ratios) in potential OLGs using multiple sequence alignments to predict functional overlaps with low false-positive rates.[45] OpenProt employs ORF prediction algorithms like PRICE to identify alternative ORFs (altORFs), including those overlapping canonical genes, by scanning all reading frames and incorporating proteogenomic evidence for eukaryotic genomes.[46] Additionally, GETORF, part of the EMBOSS suite, facilitates frame-specific ORF extraction across six reading frames, enabling initial detection of potential overlaps by identifying start and stop codon boundaries in non-canonical frames. Algorithms underpinning these tools often utilize hidden Markov models (HMMs) for probabilistic ORF scanning, where states represent coding or non-coding regions, and transitions model frame shifts to tolerate overlaps in compact genomes like viruses.[1] More recent advances incorporate machine learning. These methods extend traditional gene finders like Glimmer, which use HMMs trained on confirmed OLGs to improve sensitivity for overlapping predictions.[1] A typical workflow begins with input of assembled genomic sequences, followed by six-frame translation and alignment of coding sequences (CDS) using tools like BLAST for homology detection or HMMER for profile-based searches to quantify overlap extent and conservation.[1] Overlap regions are then scored for functionality, such as through selection analysis in OLGenie or bias detection in OpenProt, yielding annotated OLG candidates. However, these approaches can produce false positives in noisy or divergent data, such as low-coverage assemblies or highly variable viral sequences; integrating codon usage bias models mitigates this by filtering overlaps lacking organism-specific codon preferences.[1] Experimental validation remains essential to confirm predicted OLGs.[1]

Experimental Methods

Proteogenomics integrates mass spectrometry-based proteomics with genomic and transcriptomic data to detect peptides derived from alternative open reading frames (altORFs), including those overlapping annotated genes, thereby validating their translation in human cells. In the 2010s, large-scale studies applied tandem mass spectrometry (MS/MS) to human proteome samples, identifying novel peptides from unannotated regions that expanded the known proteome by thousands of entries. For instance, a 2018 proteogenomics analysis of human cell lines and tissues used MS/MS to discover coding regions in previously non-coding sequences, including overlapping altORFs, by matching spectra against custom databases derived from RNA-seq data. This approach has been pivotal in confirming the expression of small overlapping proteins in eukaryotes, with one 2024 study screening over 50,000 MS runs to identify more than 170,000 novel peptides, many from altORFs overlapping canonical genes in human tissues.[47][48][49] Ribosome profiling, or Ribo-seq, maps ribosome-protected mRNA fragments to reveal actively translated regions, including overlapping frames that computational methods might overlook, providing direct evidence of altORF translation. This technique has been instrumental in viral genomes, where overlapping genes are common; for example, Ribo-seq analysis of SARS-CoV-2-infected cells in 2020 detected translation of canonical and non-canonical ORFs, including the overlapping ORF3d within the N gene, with ribosome footprints accumulating at specific start sites. In eukaryotes, Ribo-seq variants have identified translation of upstream overlapping ORFs. These studies typically involve treating cells with translation inhibitors like cycloheximide, isolating ribosome-protected fragments, and sequencing them to quantify translation efficiency across frames.[50][51][52] Functional assays confirm the biological roles of overlapping genes by perturbing one frame and observing impacts on the other, often using CRISPR-based knockouts or reporter gene fusions. CRISPR-Cas9 knockouts target specific altORFs to assess phenotypic effects, such as in a 2020 human genome-wide screen that disrupted overlapping genes and revealed dependencies in cancer cells, where loss of an altORF altered canonical protein function and cell viability. Reporter gene fusions, such as fusing overlapping ORF promoters or coding sequences to luciferase or GFP, quantify expression regulation; in bacterial systems, these have demonstrated how stop-start overlaps control translation efficiency, while in viruses like turnip yellow mosaic virus, dual-reporter constructs showed coordinated expression of nested overlapping ORFs. These assays highlight interdependencies, as mutating one overlapping gene often modulates the other's stability or activity.[48][53][54] Recent advances in single-cell Ribo-seq, emerging around 2023, enable detection of altORF translation in individual eukaryotic cells, addressing heterogeneity in overlapping gene expression that bulk methods miss. Techniques like nanoRibo-seq, optimized for low-input samples, have revealed cell-type-specific translation of upstream altORFs in neuronal genes, with ribosome pausing patterns indicating regulatory roles. A 2021 Nature Reviews Genetics article on overlapping genes emphasized these methods' potential, citing examples from human and viral systems where single-cell resolution uncovered dynamic altORF activity during stress or infection, paving the way for tissue-specific functional studies.[55][1]

References

User Avatar
No comments yet.