Recent from talks
Contribute something
Nothing was collected or created yet.
Gene duplication
View on WikipediaGene duplication (or chromosomal duplication or gene amplification) is a mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. Gene duplications can arise as products of several types of errors in DNA replication and repair machinery as well as through fortuitous capture by selfish genetic elements. Common sources of gene duplications include ectopic recombination, retrotransposition event, aneuploidy, polyploidy, and replication slippage.[1]
Mechanisms of duplication
[edit]Ectopic recombination
[edit]Duplications arise from an event termed unequal crossing-over that occurs during meiosis between misaligned homologous chromosomes. The chance of it happening is a function of the degree of sharing of repetitive elements between two chromosomes. The products of this recombination are a duplication at the site of the exchange and a reciprocal deletion. Ectopic recombination is typically mediated by sequence similarity at the duplicate breakpoints, which form direct repeats. Repetitive genetic elements such as transposable elements offer one source of repetitive DNA that can facilitate recombination, and they are often found at duplication breakpoints in plants and mammals.[2]

Replication slippage
[edit]Replication slippage is an error in DNA replication that can produce duplications of short genetic sequences. During replication DNA polymerase begins to copy the DNA. At some point during the replication process, the polymerase dissociates from the DNA and replication stalls. When the polymerase reattaches to the DNA strand, it aligns the replicating strand to an incorrect position and incidentally copies the same section more than once. Replication slippage is also often facilitated by repetitive sequences, but requires only a few bases of similarity.[citation needed]
Retrotransposition
[edit]Retrotransposons, mainly L1, can occasionally act on cellular mRNA. Transcripts are reverse transcribed to DNA and inserted into random place in the genome, creating retrogenes. Resulting sequence usually lack introns and often contain poly(A) sequences that are also integrated into the genome. Many retrogenes display changes in gene regulation in comparison to their parental gene sequences, which sometimes results in novel functions. Retrogenes can move between different chromosomes to shape chromosomal evolution.[3]
Aneuploidy
[edit]Aneuploidy occurs when nondisjunction at a single chromosome results in an abnormal number of chromosomes. Aneuploidy is often harmful and in mammals regularly leads to spontaneous abortions (miscarriages). Some aneuploid individuals are viable, for example trisomy 21 in humans, which leads to Down syndrome. Aneuploidy often alters gene dosage in ways that are detrimental to the organism; therefore, it is unlikely to spread through populations.
Polyploidy
[edit]Polyploidy, or whole genome duplication, is a product of nondisjunction during meiosis which results in additional copies of the entire genome. Polyploidy is common in plants, but it has also occurred in animals, with two rounds of whole genome duplication (2R event) in the vertebrate lineage leading to humans.[4] It has also occurred in the hemiascomycete yeasts ~100 mya.[5][6]
After a whole genome duplication, there is a relatively short period of genome instability, extensive gene loss, elevated levels of nucleotide substitution and regulatory network rewiring.[7][8] In addition, gene dosage effects play a significant role.[9] Thus, most duplicates are lost within a short period, however, a considerable fraction of duplicates survive.[10] Interestingly, genes involved in regulation are preferentially retained.[11][12] Furthermore, retention of regulatory genes, most notably the Hox genes, has led to adaptive innovation.
Rapid evolution and functional divergence have been observed at the level of the transcription of duplicated genes, usually by point mutations in short transcription factor binding motifs.[13][14] Furthermore, rapid evolution of protein phosphorylation motifs, usually embedded within rapidly evolving intrinsically disordered regions is another contributing factor for survival and rapid adaptation/neofunctionalization of duplicate genes.[15] Thus, a link seems to exist between gene regulation (at least at the post-translational level) and genome evolution.[15]
Polyploidy is also a well known source of speciation, as offspring, which have different numbers of chromosomes compared to parent species, are often unable to interbreed with non-polyploid organisms. Whole genome duplications are thought to be less detrimental than aneuploidy as the relative dosage of individual genes should be the same.
As an evolutionary event
[edit]
Rate of gene duplication
[edit]Comparisons of genomes demonstrate that gene duplications are common in most species investigated. This is indicated by variable copy numbers (copy number variation) in the genome of humans[16][17] or fruit flies.[18] However, it has been difficult to measure the rate at which such duplications occur. Recent studies yielded a first direct estimate of the genome-wide rate of gene duplication in Caenorhabditis elegans, the first multicellular eukaryote for which such as estimate became available. The gene duplication rate in C. elegans is on the order of 10−7 duplications/gene/generation, that is, in a population of 10 million worms, one will have a gene duplication per generation. This rate is two orders of magnitude greater than the spontaneous rate of point mutation per nucleotide site in this species.[19] Older (indirect) studies reported locus-specific duplication rates in bacteria, Drosophila, and humans ranging from 10−3 to 10−7/gene/generation.[20][21][22]
Genome duplication in cancer
[edit]Genome duplication does not occur as a single event but as a continuous process during tumor progression, generating cells with different degrees of ploidy. More than 60% of the tumors analyzed showed multiple whole-genome duplication (WGD) events, suggesting an active evolutionary model within the tumor.[23]
Neofunctionalization
[edit]Gene duplications are an essential source of genetic novelty that can lead to evolutionary innovation. Duplication creates genetic redundancy, where the second copy of the gene is often free from selective pressure—that is, mutations of it have no deleterious effects to its host organism. If one copy of a gene experiences a mutation that affects its original function, the second copy can serve as a 'spare part' and continue to function correctly. Thus, duplicate genes accumulate mutations faster than a functional single-copy gene, over generations of organisms, and it is possible for one of the two copies to develop a new and different function. Some examples of such neofunctionalization is the apparent mutation of a duplicated digestive gene in a family of ice fish into an antifreeze gene and duplication leading to a novel snake venom gene[24] and the synthesis of 1 beta-hydroxytestosterone in pigs.[25]
Gene duplication is believed to play a major role in evolution; this stance has been held by members of the scientific community for over 100 years.[26] Susumu Ohno was one of the most famous developers of this theory in his classic book Evolution by gene duplication (1970).[27] Ohno argued that gene duplication is the most important evolutionary force since the emergence of the universal common ancestor.[28] Major genome duplication events can be quite common. It is believed that the entire yeast genome underwent duplication about 100 million years ago.[29] Plants are the most prolific genome duplicators. For example, wheat is hexaploid (a kind of polyploid), meaning that it has six copies of its genome.
Subfunctionalization
[edit]Another possible fate for duplicate genes is that both copies are equally free to accumulate degenerative mutations, so long as any defects are complemented by the other copy. This leads to a neutral "subfunctionalization" (a process of constructive neutral evolution) or DDC (duplication-degeneration-complementation) model,[30][31] in which the functionality of the original gene is distributed among the two copies. Neither gene can be lost, as both now perform important non-redundant functions, but ultimately neither is able to achieve novel functionality.
Subfunctionalization can occur through neutral processes in which mutations accumulate with no detrimental or beneficial effects. However, in some cases subfunctionalization can occur with clear adaptive benefits. If an ancestral gene is pleiotropic and performs two functions, often neither one of these two functions can be changed without affecting the other function. In this way, partitioning the ancestral functions into two separate genes can allow for adaptive specialization of subfunctions, thereby providing an adaptive benefit.[32]
Loss
[edit]Often the resulting genomic variation leads to gene dosage dependent neurological disorders such as Rett-like syndrome and Pelizaeus–Merzbacher disease.[33] Such detrimental mutations are likely to be lost from the population and will not be preserved or develop novel functions. However, many duplications are, in fact, not detrimental or beneficial, and these neutral sequences may be lost or may spread through the population through random fluctuations via genetic drift.
Identifying duplications in sequenced genomes
[edit]Criteria and single genome scans
[edit]The two genes that exist after a gene duplication event are called paralogs and usually code for proteins with a similar function and/or structure. By contrast, orthologous genes present in different species which are each originally derived from the same ancestral sequence. (See Homology of sequences in genetics).
It is important (but often difficult) to differentiate between paralogs and orthologs in biological research. Experiments on human gene function can often be carried out on other species if a homolog to a human gene can be found in the genome of that species, but only if the homolog is orthologous. If they are paralogs and resulted from a gene duplication event, their functions are likely to be too different. One or more copies of duplicated genes that constitute a gene family may be affected by insertion of transposable elements that causes significant variation between them in their sequence and finally may become responsible for divergent evolution. This may also render the chances and the rate of gene conversion between the homologs of gene duplicates due to less or no similarity in their sequences.
Paralogs can be identified in single genomes through a sequence comparison of all annotated gene models to one another. Such a comparison can be performed on translated amino acid sequences (e.g. BLASTp, tBLASTx) to identify ancient duplications or on DNA nucleotide sequences (e.g. BLASTn, megablast) to identify more recent duplications. Most studies to identify gene duplications require reciprocal-best-hits or fuzzy reciprocal-best-hits, where each paralog must be the other's single best match in a sequence comparison.[34]
Most gene duplications exist as low copy repeats (LCRs), rather highly repetitive sequences like transposable elements. They are mostly found in pericentronomic, subtelomeric and interstitial regions of a chromosome. Many LCRs, due to their size (>1Kb), similarity, and orientation, are highly susceptible to duplications and deletions.
Genomic microarrays detect duplications
[edit]Technologies such as genomic microarrays, also called array comparative genomic hybridization (array CGH), are used to detect chromosomal abnormalities, such as microduplications, in a high throughput fashion from genomic DNA samples. In particular, DNA microarray technology can simultaneously monitor the expression levels of thousands of genes across many treatments or experimental conditions, greatly facilitating the evolutionary studies of gene regulation after gene duplication or speciation.[35][36]
Next generation sequencing
[edit]Gene duplications can also be identified through the use of next-generation sequencing platforms. The simplest means to identify duplications in genomic resequencing data is through the use of paired-end sequencing reads. Tandem duplications are indicated by sequencing read pairs which map in abnormal orientations. Through a combination of increased sequence coverage and abnormal mapping orientation, it is possible to identify duplications in genomic sequencing data.
Nomenclature
[edit]
The International System for Human Cytogenomic Nomenclature (ISCN) is an international standard for human chromosome nomenclature, which includes band names, symbols and abbreviated terms used in the description of human chromosome and chromosome abnormalities. Abbreviations include dup for duplications of parts of a chromosome.[37] For example, dup(17p12) causes Charcot–Marie–Tooth disease type 1A.[38]
As amplification
[edit]Gene duplication does not necessarily constitute a lasting change in a species' genome. In fact, such changes often don't last past the initial host organism. From the perspective of molecular genetics, gene amplification is one of many ways in which a gene can be overexpressed. Genetic amplification can occur artificially, as with the use of the polymerase chain reaction technique to amplify short strands of DNA in vitro using enzymes, or it can occur naturally, as described above. If it's a natural duplication, it can still take place in a somatic cell, rather than a germline cell (which would be necessary for a lasting evolutionary change).
Role in cancer
[edit]Duplications of oncogenes are a common cause of many types of cancer. In such cases the genetic duplication occurs in a somatic cell and affects only the genome of the cancer cells themselves, not the entire organism, much less any subsequent offspring. Recent comprehensive patient-level classification and quantification of driver events in TCGA cohorts revealed that there are on average 12 driver events per tumor, of which 1.5 are amplifications of oncogenes.[39]
| Cancer type | Associated gene amplifications |
Prevalence of amplification in cancer type (percent) |
|---|---|---|
| Breast cancer | MYC | 20%[40] |
| ERBB2 (HER2) | 20%[40] | |
| CCND1 (Cyclin D1) | 15–20%[40] | |
| FGFR1 | 12%[40] | |
| FGFR2 | 12%[40] | |
| Cervical cancer | MYC | 25–50%[40] |
| ERBB2 | 20%[40] | |
| Colorectal cancer | HRAS | 30%[40] |
| KRAS | 20%[40] | |
| MYB | 15–20%[40] | |
| Esophageal cancer | MYC | 40%[40] |
| CCND1 | 25%[40] | |
| MDM2 | 13%[40] | |
| Gastric cancer | CCNE (Cyclin E) | 15%[40] |
| KRAS | 10%[40] | |
| MET | 10%[40] | |
| Glioblastoma | ERBB1 (EGFR) | 33–50%[40] |
| CDK4 | 15%[40] | |
| Head and neck cancer | CCND1 | 50%[40] |
| ERBB1 | 10%[40] | |
| MYC | 7–10%[40] | |
| Hepatocellular cancer | CCND1 | 13%[40] |
| Neuroblastoma | MYCN | 20–25%[40] |
| Ovarian cancer | MYC | 20–30%[40] |
| ERBB2 | 15–30%[40] | |
| AKT2 | 12%[40] | |
| Sarcoma | MDM2 | 10–30%[40] |
| CDK4 | 10%[40] | |
| Small cell lung cancer | MYC | 15–20%[40] |
Whole-genome duplications are also frequent in cancers, detected in 30% to 36% of tumors from the most common cancer types.[41][42] Their exact role in carcinogenesis is unclear, but they in some cases lead to loss of chromatin segregation leading to chromatin conformation changes that in turn lead to oncogenic epigenetic and transcriptional modifications.[43]
See also
[edit]References
[edit]- ^ Zhang J (2003). "Evolution by gene duplication: an update" (PDF). Trends in Ecology & Evolution. 18 (6): 292–8. doi:10.1016/S0169-5347(03)00033-8.
- ^ "Definition of Gene duplication". medterms medical dictionary. MedicineNet. 2012-03-19. Archived from the original on 2014-03-06. Retrieved 2008-12-01.
- ^ Miller, Duncan; Chen, Jianhai; Liang, Jiangtao; Betrán, Esther; Long, Manyuan; Sharakhov, Igor V. (2022-05-28). "Retrogene Duplication and Expression Patterns Shaped by the Evolution of Sex Chromosomes in Malaria Mosquitoes". Genes. 13 (6): 968. doi:10.3390/genes13060968. ISSN 2073-4425. PMC 9222922. PMID 35741730.
- ^ Dehal P, Boore JL (October 2005). "Two rounds of whole genome duplication in the ancestral vertebrate". PLOS Biology. 3 (10) e314. doi:10.1371/journal.pbio.0030314. PMC 1197285. PMID 16128622.
- ^ Wolfe, K. H.; Shields, D. C. (1997-06-12). "Molecular evidence for an ancient duplication of the entire yeast genome". Nature. 387 (6634): 708–713. Bibcode:1997Natur.387..708W. doi:10.1038/42711. ISSN 0028-0836. PMID 9192896. S2CID 4307263.
- ^ Kellis, Manolis; Birren, Bruce W.; Lander, Eric S. (2004-04-08). "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae". Nature. 428 (6983): 617–624. Bibcode:2004Natur.428..617K. doi:10.1038/nature02424. ISSN 1476-4687. PMID 15004568. S2CID 4422074.
- ^ Otto, Sarah P. (2007-11-02). "The evolutionary consequences of polyploidy". Cell. 131 (3): 452–462. doi:10.1016/j.cell.2007.10.022. ISSN 0092-8674. PMID 17981114. S2CID 10054182.
- ^ Conant, Gavin C.; Wolfe, Kenneth H. (April 2006). "Functional partitioning of yeast co-expression networks after genome duplication". PLOS Biology. 4 (4) e109. doi:10.1371/journal.pbio.0040109. ISSN 1545-7885. PMC 1420641. PMID 16555924.
- ^ Papp, Balázs; Pál, Csaba; Hurst, Laurence D. (2003-07-10). "Dosage sensitivity and the evolution of gene families in yeast". Nature. 424 (6945): 194–197. Bibcode:2003Natur.424..194P. doi:10.1038/nature01771. ISSN 1476-4687. PMID 12853957. S2CID 4382441.
- ^ Lynch, M.; Conery, J. S. (2000-11-10). "The evolutionary fate and consequences of duplicate genes". Science. 290 (5494): 1151–1155. Bibcode:2000Sci...290.1151L. doi:10.1126/science.290.5494.1151. ISSN 0036-8075. PMID 11073452.
- ^ Freeling, Michael; Thomas, Brian C. (July 2006). "Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity". Genome Research. 16 (7): 805–814. doi:10.1101/gr.3681406. ISSN 1088-9051. PMID 16818725.
- ^ Davis, Jerel C.; Petrov, Dmitri A. (October 2005). "Do disparate mechanisms of duplication add similar genes to the genome?". Trends in Genetics. 21 (10): 548–551. doi:10.1016/j.tig.2005.07.008. ISSN 0168-9525. PMID 16098632.
- ^ Casneuf, Tineke; De Bodt, Stefanie; Raes, Jeroen; Maere, Steven; Van de Peer, Yves (2006). "Nonrandom divergence of gene expression following gene and genome duplications in the flowering plant Arabidopsis thaliana". Genome Biology. 7 (2): R13. doi:10.1186/gb-2006-7-2-r13. ISSN 1474-760X. PMC 1431724. PMID 16507168.
- ^ Li, Wen-Hsiung; Yang, Jing; Gu, Xun (November 2005). "Expression divergence between duplicate genes". Trends in Genetics. 21 (11): 602–607. doi:10.1016/j.tig.2005.08.006. ISSN 0168-9525. PMID 16140417.
- ^ a b Amoutzias, Grigoris D.; He, Ying; Gordon, Jonathan; Mossialos, Dimitris; Oliver, Stephen G.; Van de Peer, Yves (2010-02-16). "Posttranslational regulation impacts the fate of duplicated genes". Proceedings of the National Academy of Sciences of the United States of America. 107 (7): 2967–2971. Bibcode:2010PNAS..107.2967A. doi:10.1073/pnas.0911603107. ISSN 1091-6490. PMC 2840353. PMID 20080574.
- ^ Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, et al. (July 2004). "Large-scale copy number polymorphism in the human genome". Science. 305 (5683): 525–8. Bibcode:2004Sci...305..525S. doi:10.1126/science.1098918. PMID 15273396. S2CID 20357402.
- ^ Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, et al. (September 2004). "Detection of large-scale variation in the human genome". Nature Genetics. 36 (9): 949–51. doi:10.1038/ng1416. PMID 15286789.
- ^ Emerson JJ, Cardoso-Moreira M, Borevitz JO, Long M (June 2008). "Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster". Science. 320 (5883): 1629–31. Bibcode:2008Sci...320.1629E. doi:10.1126/science.1158078. PMID 18535209. S2CID 206512885.
- ^ Lipinski KJ, Farslow JC, Fitzpatrick KA, Lynch M, Katju V, Bergthorsson U (February 2011). "High spontaneous rate of gene duplication in Caenorhabditis elegans". Current Biology. 21 (4): 306–10. Bibcode:2011CBio...21..306L. doi:10.1016/j.cub.2011.01.026. PMC 3056611. PMID 21295484.
- ^ Anderson P, Roth J (May 1981). "Spontaneous tandem genetic duplications in Salmonella typhimurium arise by unequal recombination between rRNA (rrn) cistrons". Proceedings of the National Academy of Sciences of the United States of America. 78 (5): 3113–7. Bibcode:1981PNAS...78.3113A. doi:10.1073/pnas.78.5.3113. PMC 319510. PMID 6789329.
- ^ Watanabe Y, Takahashi A, Itoh M, Takano-Shimizu T (March 2009). "Molecular spectrum of spontaneous de novo mutations in male and female germline cells of Drosophila melanogaster". Genetics. 181 (3): 1035–43. doi:10.1534/genetics.108.093385. PMC 2651040. PMID 19114461.
- ^ Turner DJ, Miretti M, Rajan D, Fiegler H, Carter NP, Blayney ML, et al. (January 2008). "Germline rates of de novo meiotic deletions and duplications causing several genomic disorders". Nature Genetics. 40 (1): 90–5. doi:10.1038/ng.2007.40. PMC 2669897. PMID 18059269.
- ^ McPherson, Andrew (2025). "Ongoing genome doubling shapes evolvability and immunity in ovarian cancer". Nature. 644: 1078–1086. doi:10.1038/s41586-025-09240-3.
- ^ Lynch VJ (January 2007). "Inventing an arsenal: adaptive evolution and neofunctionalization of snake venom phospholipase A2 genes". BMC Evolutionary Biology. 7: 2. doi:10.1186/1471-2148-7-2. PMC 1783844. PMID 17233905.
- ^ Conant GC, Wolfe KH (December 2008). "Turning a hobby into a job: how duplicated genes find new functions". Nature Reviews. Genetics. 9 (12): 938–50. doi:10.1038/nrg2482. PMID 19015656. S2CID 1240225.
- ^ Taylor JS, Raes J (2004). "Duplication and divergence: the evolution of new genes and old ideas". Annual Review of Genetics. 38: 615–43. doi:10.1146/annurev.genet.38.072902.092831. PMID 15568988.
- ^ Ohno, S. (1970). Evolution by gene duplication. Springer-Verlag. ISBN 978-0-04-575015-3.
- ^ Ohno, S. (1967). Sex Chromosomes and Sex-linked Genes. Springer-Verlag. ISBN 978-91-554-5776-1.
- ^ Kellis M, Birren BW, Lander ES (April 2004). "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae". Nature. 428 (6983): 617–24. Bibcode:2004Natur.428..617K. doi:10.1038/nature02424. PMID 15004568. S2CID 4422074.
- ^ Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J (April 1999). "Preservation of duplicate genes by complementary, degenerative mutations". Genetics. 151 (4): 1531–45. doi:10.1093/genetics/151.4.1531. PMC 1460548. PMID 10101175.
- ^ Stoltzfus A (August 1999). "On the possibility of constructive neutral evolution". Journal of Molecular Evolution. 49 (2): 169–81. Bibcode:1999JMolE..49..169S. CiteSeerX 10.1.1.466.5042. doi:10.1007/PL00006540. PMID 10441669. S2CID 1743092.
- ^ Des Marais DL, Rausher MD (August 2008). "Escape from adaptive conflict after duplication in an anthocyanin pathway gene". Nature. 454 (7205): 762–5. Bibcode:2008Natur.454..762D. doi:10.1038/nature07092. PMID 18594508. S2CID 418964.
- ^ Lee JA, Lupski JR (October 2006). "Genomic rearrangements and gene copy-number alterations as a cause of nervous system disorders". Neuron. 52 (1): 103–21. doi:10.1016/j.neuron.2006.09.027. PMID 17015230. S2CID 22412305.
- ^ Hahn MW, Han MV, Han SG (November 2007). "Gene family evolution across 12 Drosophila genomes". PLOS Genetics. 3 (11) e197. doi:10.1371/journal.pgen.0030197. PMC 2065885. PMID 17997610.
- ^ Mao R, Pevsner J (2005). "The use of genomic microarrays to study chromosomal abnormalities in mental retardation". Mental Retardation and Developmental Disabilities Research Reviews. 11 (4): 279–85. doi:10.1002/mrdd.20082. PMID 16240409.
- ^ Gu X, Zhang Z, Huang W (January 2005). "Rapid evolution of expression and regulatory divergences after yeast gene duplication". Proceedings of the National Academy of Sciences of the United States of America. 102 (3): 707–12. Bibcode:2005PNAS..102..707G. doi:10.1073/pnas.0409186102. PMC 545572. PMID 15647348.
- ^ "ISCN Symbols and Abbreviated Terms". Coriell Institute for Medical Research. Retrieved 2022-10-27.
- ^ Cassandra L. Kniffin. "HARCOT-MARIE-TOOTH DISEASE, DEMYELINATING, TYPE 1A; CMT1A". OMIM. Updated : 4/23/2014
- ^ Vyatkin, Alexey D.; Otnyukov, Danila V.; Leonov, Sergey V.; Belikov, Aleksey V. (14 January 2022). "Comprehensive patient-level classification and quantification of driver events in TCGA PanCanAtlas cohorts". PLOS Genetics. 18 (1) e1009996. doi:10.1371/journal.pgen.1009996. PMC 8759692. PMID 35030162.
- ^ a b c d e f g h i j k l m n o p q r s t u v w x y z aa ab ac Kinzler KW, Vogelstein B (2002). The genetic basis of human cancer. McGraw-Hill. p. 116. ISBN 978-0-07-137050-9.
- ^ Bielski, Craig M.; Zehir, Ahmet; Penson, Alexander V.; Donoghue, Mark T. A.; Chatila, Walid; Armenia, Joshua; Chang, Matthew T.; Schram, Alison M.; Jonsson, Philip; Bandlamudi, Chaitanya; Razavi, Pedram; Iyer, Gopa; Robson, Mark E.; Stadler, Zsofia K.; Schultz, Nikolaus (2018). "Genome doubling shapes the evolution and prognosis of advanced cancers". Nature Genetics. 50 (8): 1189–1195. doi:10.1038/s41588-018-0165-1. ISSN 1546-1718. PMC 6072608. PMID 30013179.
- ^ Quinton, Ryan J.; DiDomizio, Amanda; Vittoria, Marc A.; Kotýnková, Kristýna; Ticas, Carlos J.; Patel, Sheena; Koga, Yusuke; Vakhshoorzadeh, Jasmine; Hermance, Nicole; Kuroda, Taruho S.; Parulekar, Neha; Taylor, Alison M.; Manning, Amity L.; Campbell, Joshua D.; Ganem, Neil J. (2021). "Whole-genome doubling confers unique genetic vulnerabilities on tumour cells". Nature. 590 (7846): 492–497. Bibcode:2021Natur.590..492Q. doi:10.1038/s41586-020-03133-3. ISSN 1476-4687. PMC 7889737. PMID 33505027.
- ^ Lambuta, Ruxandra A.; Nanni, Luca; Liu, Yuanlong; Diaz-Miyar, Juan; Iyer, Arvind; Tavernari, Daniele; Katanayeva, Natalya; Ciriello, Giovanni; Oricchio, Elisa (2023-03-15). "Whole-genome doubling drives oncogenic loss of chromatin segregation". Nature. 615 (7954): 925–933. Bibcode:2023Natur.615..925L. doi:10.1038/s41586-023-05794-2. ISSN 1476-4687. PMC 10060163. PMID 36922594.
External links
[edit]Gene duplication
View on GrokipediaFundamentals
Definition and Types
Gene duplication is a fundamental evolutionary process in which a segment of DNA containing a functional gene is copied within the genome, resulting in two or more identical or nearly identical copies of the original gene. This duplication creates genetic redundancy, allowing one copy to maintain the original function while the other may accumulate mutations without immediate deleterious effects. The process is widespread across eukaryotes and prokaryotes, contributing to genome expansion and functional innovation, as first systematically explored in Susumu Ohno's seminal work.[6][7] Gene duplications are classified into several types based on their genomic scale and mechanism of origin. Tandem duplications occur when copies are generated adjacent to each other on the same chromosome, often through errors in recombination, resulting in gene clusters. Dispersed duplications produce non-adjacent copies scattered across the genome, typically via transposition events like retrotransposition or DNA-mediated movement. Segmental duplications involve larger blocks of DNA, encompassing multiple genes, duplicated within or between chromosomes. Whole-genome duplications (WGD), also known as polyploidy events, replicate the entire genome, leading to multiple copies of all genes simultaneously; these are particularly common in plants but have occurred in vertebrate lineages as well.[2][7] At the molecular level, gene duplication immediately introduces redundancy, where the duplicate copies share overlapping functions and are initially under relaxed purifying selection, as mutations in one copy are buffered by the other. This reduces selective pressure on the duplicates, permitting neutral or slightly deleterious changes to accumulate without disrupting essential functions, though most duplicates are eventually lost or pseudogenized. Functional divergence, if it occurs, arises later through processes like neofunctionalization or subfunctionalization, but the initial phase is characterized by preserved sequence similarity and co-regulation.[7][6] A classic example of whole-genome duplication's impact is seen in the Hox gene clusters of vertebrates, where two rounds of WGD in early vertebrate evolution produced four clusters (HoxA-D) from an ancestral single cluster, enabling spatial patterning innovations in body plans such as paired appendages.[8]Historical Context
The concept of gene duplication emerged in the early 20th century through cytogenetic studies in plants, where polyploidy—whole-genome duplication—was recognized as a common mechanism contributing to speciation and variation. Dutch botanist Hugo de Vries first described polyploid mutants in Oenothera in 1907, and by the 1910s, researchers like Albert F. Blakeslee and Øjvind Winge had identified polyploidy in various angiosperms, attributing it to chromosome doubling that amplified gene copies and facilitated evolutionary novelty.[9] These observations laid foundational evidence for duplication events at the genomic scale, particularly in plants, where polyploidy was estimated to occur in up to 70% of species by mid-century.[10] In animals, early molecular insights came from Drosophila research in the 1930s. Calvin B. Bridges demonstrated in 1936 that the Bar eye phenotype resulted from a tandem duplication of a chromosomal segment, providing the first direct evidence of segmental gene duplication and its phenotypic effects through unequal crossing over. This work hinted at duplication as a source of genetic redundancy and mutation, though it was viewed primarily as a cytological anomaly rather than an evolutionary driver. By the 1960s, the discovery of multigene families further illuminated the prevalence of duplications; for instance, ribosomal DNA (rDNA) was identified as a tandemly repeated multigene family in Drosophila by Ritossa and Spiegelman in 1965, revealing hundreds of identical copies essential for ribosome biogenesis. Similar findings in other organisms, such as histone and immunoglobulin genes, underscored that duplications generated families of related sequences, challenging the notion of genes as unique loci.[11] The modern synthesis of gene duplication as a major evolutionary mechanism crystallized in 1970 with Susumu Ohno's seminal book Evolution by Gene Duplication, which argued that duplications provide raw material for innovation by freeing redundant copies from selective constraints, allowing divergence into new functions.[6] This perspective integrated with Motoo Kimura's neutral theory of molecular evolution, proposed in 1968 and expanded in the 1970s, positing that many duplications and subsequent mutations are selectively neutral, fixed by genetic drift rather than adaptive pressure, thus explaining the abundance of pseudogenes and paralogs in genomes. Confirmation accelerated in the 1980s and 1990s with DNA sequencing technologies; for example, sequencing of the human beta-globin cluster in 1980 revealed ancient duplications underlying hemoglobin evolution, while the 1996 yeast genome sequence identified widespread paralogs from a whole-genome duplication event approximately 100 million years ago. These molecular data validated Ohno's hypothesis at scale, showing duplications accounted for 15-20% of eukaryotic genes.[12] Early reception of these ideas was marked by debates over whether duplications primarily drive adaptive innovation or accumulate neutrally. Ohno's adaptive emphasis faced skepticism from neutralists like Kimura, who argued most fixed duplicates contribute little to fitness and are lost or silenced, as evidenced by high pseudogene rates in vertebrate genomes.[13] Proponents of adaptation, however, highlighted cases like vertebrate Hox gene clusters, sequenced in the 1990s, where duplications correlated with morphological complexity. This tension persisted into the late 20th century, shaping models that balanced neutral drift with occasional positive selection in duplicate retention.[14]Mechanisms
Unequal Crossing Over
Unequal crossing over is a key mechanism of gene duplication that occurs during homologous recombination, particularly in meiosis, when misaligned homologous chromosomes or sister chromatids exchange genetic material unevenly. This misalignment leads to one recombinant chromatid receiving an extra copy of a gene or segment, while the reciprocal product experiences a deletion. The process is homology-dependent, relying on sequence similarity to initiate pairing, but errors in alignment result in non-allelic homologous recombination (NAHR), producing tandem duplications.[15][16] At the molecular level, repetitive sequences play a critical role in facilitating misalignment. Low-copy repeats (LCRs), which are paralogous segments greater than 1 kb with over 90% sequence identity, mediate NAHR by promoting ectopic pairing between non-allelic sites. Similarly, Alu elements, abundant short interspersed nuclear elements, can drive unequal exchanges due to their high copy number and sequence homology, often resulting in local duplications or larger copy-number variants. These events typically yield tandem arrays, where duplicated genes are arranged in direct orientation adjacent to the original copy, enhancing the potential for further evolutionary changes. Segmental duplications, involving large (often >10 kb) non-tandem copies of chromosomal regions, can also arise via NAHR between dispersed LCRs, contributing to genomic architecture and disease susceptibility.[17][18][19] The frequency of unequal crossing over is elevated in genomic regions enriched with LCRs or Alu elements, as these repeats increase the likelihood of misalignment during synapsis. Such hotspots are common in gene clusters prone to instability, where even low-level homology (e.g., 25-39 bp identity) can suffice for recombination. In human sperm, for instance, de novo duplications occur at rates around 10^{-5} per meiosis, predominantly through intermolecular exchanges between homologous chromosomes.[20][18] A prominent example is the duplication within the human alpha-globin gene cluster on chromosome 16, where unequal crossing over between the alpha2 (HBA2) and alpha1 (HBA1) genes generates anti-3.7 kb duplications, resulting in three alpha-globin genes (ααα configuration). This event, driven by Z-box repetitive homology blocks flanking the genes, is reciprocal to common alpha-thalassemia deletions and underscores how such mechanisms contribute to both normal variation and disease predisposition.[20]Replication-Based Errors
Replication-based errors during DNA synthesis represent a primary mechanism for generating small-scale gene duplications, particularly those involving short tandem repeats (STRs). In this process, known as replication slippage or slipped-strand mispairing, the DNA polymerase temporarily dissociates from the template strand within repetitive sequences, leading to misalignment upon re-annealing. This slippage can cause the polymerase to skip forward (resulting in deletions) or repeat a segment (producing duplications) of the template, typically affecting sequences under 1 kb in length. Such errors are exacerbated in regions rich in STRs, where the repetitive nature facilitates strand dissociation during the S-phase of the cell cycle.[21][22] At the molecular level, replication fork stalling plays a central role, often triggered by non-B DNA structures such as hairpins or triplexes formed in repetitive or AT-rich sequences during strand unwinding. The fork stalling and template switching (FoSTeS) model describes how a stalled fork disengages, with the nascent strand invading a secondary template via microhomology (typically 2–15 bp), resuming synthesis and incorporating duplicated material. This mechanism accounts for both simple tandem duplications and more complex rearrangements with junctional microhomologies or insertions. Non-B structures, like stable hairpins in CAG/CTG repeats, impede polymerase progression, increasing the likelihood of template switching and duplication events. Error-prone DNA polymerases, such as those with lower fidelity (e.g., inversely correlated with proofreading efficiency), further promote slippage by stabilizing misaligned intermediates during synthesis.[23][24][25] These errors are more frequent for microduplications under 1 kb, occurring at elevated rates in regions of replication stress, such as fragile sites or late-replicating heterochromatin domains. Replication timing influences susceptibility, with late-replicating regions exhibiting higher mutation rates due to prolonged exposure to endogenous stresses and reduced proofreading efficiency. Experimental induction of replication stress (e.g., via aphidicolin) generates non-recurrent copy number variants (CNVs), including duplications, at frequencies mimicking spontaneous events, with breakpoints often showing microhomologies consistent with FoSTeS. Small tandem duplications of 15–300 bp are observed in up to 25% of certain disease alleles, underscoring their prevalence in genomic instability.[26][27][28] A representative example is the expansion of CAG trinucleotide repeats in the HTT gene, associated with Huntington's disease. Slippage during replication of these repeats leads to duplication of the triplet units, with hairpin formation on the nascent strand promoting further iterations and expansions beyond 36 repeats, resulting in toxic protein aggregates. This process highlights how replication errors in STRs can drive pathological duplications while contributing to evolutionary variation in repeat copy number.[21][24]Transposition Events
Transposition events contribute to gene duplication through retrotransposition, a process in which mature mRNA transcripts are reverse-transcribed into complementary DNA (cDNA) and randomly inserted into new genomic locations, generating retrogene copies of the original gene.[29] This RNA-mediated mechanism differs from direct DNA duplication by relying on an intermediary transcript, often utilizing the enzymatic machinery of endogenous retroelements to facilitate the insertion.[29] At the molecular level, long interspersed nuclear element-1 (LINE-1 or L1) retrotransposons play a central role by providing the reverse transcriptase enzyme, which converts the mRNA into cDNA via a target-primed reverse transcription process.[29] The resulting retrogenes typically lack introns, as the source mRNA is processed and spliced, and they often insert without their original promoters or regulatory elements, leading to poly(A) tails at the 3' end but potential initial transcriptional silence unless new regulatory sequences are acquired nearby.[29] These characteristics distinguish retrogenes from intron-containing duplicates formed by other mechanisms.[30] Retrotransposition is particularly prevalent in mammalian genomes, where LINE-1 activity has driven a significant portion of processed pseudogene formation, accounting for about 70% of non-functional gene duplicates in humans.[30] In the human genome, estimates indicate approximately 8,000 to 17,000 retrocopies exist, many of which originated from primate lineage expansions around 40-50 million years ago.[31] This abundance underscores retrotransposition's role in genomic plasticity, though most retrogenes become pseudogenes, with a subset evolving new functions post-fixation.[29] A notable example of retrotransposition's impact on gene family expansion involves the PGAM family, where functional retrocopies like PGAM5 have arisen and acquired new roles in cellular processes.[32]Chromosomal Alterations
Chromosomal alterations represent a major mechanism for generating gene duplications on a large scale, primarily through aneuploidy and polyploidy, which result in the gain or multiplication of entire chromosomes or genomes, thereby creating multiple copies of numerous genes simultaneously.[33] Aneuploidy involves the abnormal gain or loss of one or more chromosomes, leading to an imbalance in gene dosage where affected cells possess extra or fewer copies of genes on those chromosomes.[34] This process often arises from nondisjunction, the failure of homologous chromosomes or sister chromatids to separate properly during mitosis or meiosis, which disrupts normal chromosome segregation and produces gametes or daughter cells with altered chromosome numbers.[35] In contrast, polyploidy entails the duplication of the entire genome, instantly doubling or multiplying gene copies across all chromosomes, and can occur through mechanisms such as hybridization between species (leading to allopolyploidy) or endoreduplication, where cells undergo repeated DNA replication without mitosis or cytokinesis.[36] These alterations extend beyond single-gene events, affecting vast genomic regions and providing raw material for evolutionary innovation.[37] Aneuploidy is typically transient in most organisms due to its disruptive effects on cellular function, but it can become fixed in certain lineages, contributing to gene copy variation.[38] Polyploidy, however, is far more stable and prevalent, particularly in plants, where it serves as a key driver of speciation and adaptation. Recent estimates suggest that polyploidy accompanies approximately 15% of speciation events in angiosperms, though older studies proposed higher figures of 30–80%.[39][40] In animals, polyploidy and related aneuploid events are rarer owing to challenges in meiosis and development, yet they have played pivotal roles in major evolutionary transitions, such as in vertebrates. For instance, two rounds of whole-genome duplication (2R) occurred in the ancestral vertebrate lineage approximately 500–600 million years ago, followed by a third round (3R) in teleost fish, which expanded gene families essential for complex traits like the nervous and immune systems.[41][42] These events underscore how chromosomal alterations can facilitate rapid genomic reconfiguration without relying on incremental small-scale duplications.Evolutionary Implications
Duplication Rates
Gene duplication rates are typically estimated through phylogenetic analyses that reconstruct the divergence times of paralogous gene pairs using molecular clocks calibrated against known evolutionary timelines. These methods account for synonymous substitution rates (Ks) between duplicates to infer when duplications occurred, providing a framework to quantify both ongoing small-scale events and episodic bursts from whole-genome duplications (WGDs).[5] In animals, the average duplication rate is approximately 0.01 events per gene per million years, based on genomic surveys of species such as humans, nematodes, fruit flies, and yeast. This rate reflects primarily tandem and segmental duplications, with estimates varying slightly by taxon; for instance, rates in vertebrates range from 0.0005 to 0.004 duplications per gene per million years when focusing on recent events. In the human genome, duplicated genes constitute about 8–20% of the total gene content, underscoring the cumulative impact of these events over evolutionary time.[5] Plants exhibit generally higher effective duplication rates, often exceeding 0.01 per gene per million years when including polyploidy-driven WGDs, which are far more prevalent in plants than in animals and can double the gene complement instantaneously.[43] For example, many plant lineages, such as Arabidopsis thaliana, show elevated retention of duplicates with half-lives of 17–25 million years, compared to 3–7 million years in animals, due to these polyploid events.[43] Several factors influence these rates across taxa. Larger genome sizes correlate with higher duplication frequencies, as expanded non-coding regions facilitate segmental duplications and transposon-mediated events. Recombination hotspots, where unequal crossing over is more likely, also elevate local duplication rates by promoting non-allelic homologous recombination.[44] Selection pressures play a key role in modulating net rates by favoring retention of duplicates under dosage constraints or novel functions, while purging redundant copies; purifying selection is stronger in essential genes, leading to faster loss rates. Variation is evident across taxa—for instance, teleost fishes display accelerated duplication dynamics post their ancient WGD event approximately 300–450 million years ago, resulting in higher proportions of paralogs (up to 20–30% in some species like zebrafish) and elevated tandem duplication rates compared to other vertebrates.[45] This burst contributed to the diversification of teleosts, which comprise over half of all vertebrate species.[46]Neofunctionalization
Neofunctionalization refers to the evolutionary process whereby, after gene duplication, one paralog acquires a novel function—such as a new enzymatic activity or a distinct expression pattern—while the other copy preserves the original ancestral role. This divergence enables the innovation of new traits without disrupting established functions, contributing to adaptive evolution across species. The concept builds on the initial redundancy created by duplication, which provides a genetic buffer for mutational experimentation.[47] At the molecular level, neofunctionalization arises from relaxed purifying selection on the duplicate gene, allowing neutral or slightly deleterious mutations to accumulate until beneficial ones confer selective advantages. These adaptive changes often involve alterations in regulatory regions, leading to novel spatiotemporal expression, or structural modifications like protein domain shuffling that enable new interactions or catalytic properties. For instance, mutations in promoter sequences can shift expression to new tissues, while exon shuffling might repurpose binding sites for different substrates. Such mechanisms have been observed in enzyme evolution, where duplicated copies develop enhanced specificity or entirely new reactions.[48][49] Evidence for neofunctionalization emerges from comparative genomics, revealing paralogous genes with specialized roles that diverged post-duplication. A prominent example is the globin gene family in vertebrates, where ancient duplications led to paralogs like alpha and beta hemoglobins adapting distinct functions in oxygen transport and storage across developmental stages and tissues, such as fetal versus adult forms. Similarly, in insects, the Drosophila bithorax complex demonstrates neofunctionalization through homeobox gene duplicates that acquired unique regulatory roles in body patterning. These cases highlight how paralogs evolve non-overlapping functions, supported by sequence divergence and functional assays.[50][51] Theoretical models underpin neofunctionalization, with Susumu Ohno's foundational framework proposing that gene duplication supplies the raw material for evolutionary novelty by freeing one copy from selective constraints. Ohno emphasized that this redundancy fosters innovation, as seen in vertebrate genome expansions. Quantitative models extend this by estimating the probability of fixation for advantageous mutations in duplicates under positive selection, often approximating 2s (where s is the selection coefficient) compared to neutral drift, which influences the likelihood of permanent divergence. These probabilistic approaches, informed by population genetics, predict higher neofunctionalization rates in large populations with strong selective pressures.[52][53]Subfunctionalization and Dosage Effects
Subfunctionalization occurs when duplicated genes partition the ancestral gene's functions between the copies, thereby reducing redundancy and promoting the retention of both paralogs. This process typically involves complementary degenerative mutations that eliminate subsets of the original regulatory elements or protein domains in each duplicate, leading to a division of labor such as tissue-specific expression or specialized biochemical roles. For instance, one copy may retain expression in certain tissues while the other takes over in different ones, ensuring that the combined functions match the pre-duplication state. This mechanism was formalized in the duplication-degeneration-complementation (DDC) model, which posits that neutral mutations in cis-regulatory sequences, like promoters, can stochastically partition ancestral expression patterns, making both copies essential for viability.[54] At the molecular level, subfunctionalization often arises through mutations affecting promoters, enhancers, or splicing sites, which alter expression timing, location, or isoform production without creating novel functions. Changes in alternative splicing can further drive this by fixing different splice variants in each paralog, preserving the ancestral proteome while distributing subroles. In the cytochrome P450 (CYP) gene family, involved in liver detoxification, duplicates have subfunctionalized to specialize in metabolizing distinct substrates, such as one paralog targeting specific xenobiotics while another handles endogenous compounds, enhancing adaptive responses to environmental toxins. This partitioning contrasts with neofunctionalization, where duplicates acquire entirely new functions, but both can contribute to long-term gene retention.[55][56] Dosage effects refer to the selective pressures maintaining balanced copy numbers in duplicated genes, particularly those encoding stoichiometric components of protein complexes, where imbalances disrupt macromolecular assembly or cellular homeostasis. Histone genes exemplify this: following duplication, yeast histone paralogs are retained to preserve precise nucleosome stoichiometry, with strong purifying selection against dosage imbalances via mechanisms like gene conversion to minimize divergence. Such balance is critical because excess or deficient gene products can impair complex formation; for instance, overexpressed histones in yeast trigger genome instability and segregation errors. In metazoans, dosage imbalances from segmental duplications or aneuploidy often lead to developmental disorders or cancer predisposition, as seen in conditions like Down syndrome where extra copies of dosage-sensitive genes perturb stoichiometric networks.[57][58]Gene Loss and Redundancy
Following gene duplication, one common evolutionary outcome is the loss of one or both copies, often through the accumulation of deleterious mutations that render the gene non-functional, transforming it into a pseudogene. This process typically begins shortly after duplication, as redundant copies experience relaxed purifying selection, allowing slightly deleterious mutations—such as frameshifts, premature stop codons, or promoter disruptions—to accumulate and fix via genetic drift.[5] In many cases, the redundant copy decays neutrally until it is completely silenced or deleted from the genome, contributing to the observation that the vast majority of duplicate genes are lost within a few million years.[59] Estimates suggest that 50-80% of duplicates may be lost or pseudogenized within this timeframe, depending on the organism and duplication mechanism, as seen in post-whole-genome duplication events in plants like rice where 30-65% of duplicates were eliminated over tens of millions of years.[60] Redundancy resolution after duplication is heavily influenced by dosage sensitivity, where genes involved in balanced complexes or stoichiometric interactions are less likely to lose a copy due to the disruptive effects of altered gene dosage. The gene balance hypothesis posits that such dosage-sensitive genes, including many transcription factors and signaling components, experience stronger selection against imbalance, leading to higher retention rates of duplicates compared to dosage-insensitive genes.[61] For instance, essential genes—those whose knockout is lethal—are disproportionately retained as duplicates, as their loss would compromise critical functions without the buffering effect of redundancy.[62] This selective pressure helps maintain genomic stability by preserving copies that mitigate dosage perturbations, while non-essential, dosage-tolerant genes are more prone to rapid elimination. Evolutionary patterns of gene loss vary with population size and ecological context, with faster pseudogenization observed in smaller populations where genetic drift accelerates the fixation of disabling mutations. In neutral models of decay, the rate of pseudogene formation approximates the genomic deleterious mutation rate (typically 10^{-5} to 10^{-6} per site per generation), but in small effective population sizes (e.g., Ne < 10^6), drift dominates, shortening the half-life of duplicates to as little as 1-5 million years on average across eukaryotes.[5] A notable example is the mammalian-specific pseudogenization of olfactory receptor genes, where rapid expansions via duplication were followed by extensive losses—up to 50% pseudogenes in humans—likely due to relaxed selection in species with diminished reliance on olfaction, such as primates.[63] These patterns underscore how gene loss streamlines genomes by removing redundant or non-adaptive sequences, reducing metabolic costs and mutational targets while adapting to niche-specific pressures.[59]Detection Methods
Computational Identification
Computational identification of gene duplications relies on analyzing single-genome sequence data to detect paralogous genes—copies arising within the same lineage—through in silico algorithms that assess sequence homology, genomic context, and evolutionary relationships.[64] Key criteria include high sequence similarity, typically requiring greater than 30-50% amino acid identity over substantial portions of the protein length (e.g., >70-90% coverage), to infer homology; synteny breaks, where conserved gene order is disrupted indicating duplication events; and paralog clustering, grouping genes into families based on shared ancestry.[64] Tools like BLAST (Basic Local Alignment Search Tool) are foundational for initial local alignments, scanning genomes for similar sequences with e-value thresholds to filter spurious matches.[64] Methods for detection encompass whole-genome alignments to pinpoint segmental duplicates, where tools such as MCScanX identify collinear blocks of homologous genes (requiring at least five pairs with minimal gaps) to reveal duplicated segments often spanning tens to hundreds of kilobases.[64] For ancient duplications, phylogenetic tree reconciliation integrates gene trees—built from multiple sequence alignments using models like WAG or HKY—with species trees to infer duplication nodes by detecting inconsistencies like excess terminal branches. These approaches enable timing of events relative to speciation, distinguishing within-species paralogs from inter-species orthologs. Challenges in these methods include accurately distinguishing paralogs (duplication-derived) from orthologs (speciation-derived), which often requires multi-species comparisons to resolve ambiguous topologies, and handling assembly errors in repetitive regions that can artifactually inflate duplication counts or misalign segments. False positives from fragmented assemblies, particularly in low-coverage genomes, necessitate filtering steps like reciprocal best hits or synteny validation. A prominent example is Ensembl's paralogy predictions, which employ a pipeline inspired by TreeFam methodology: genes are clustered via BLAST-based similarity (e.g., e-value < 1e-5), followed by multiple alignments and phylogenetic tree construction with TreeBeST for reconciliation, identifying duplications across vertebrate genomes with high precision for families like Hox genes.Array-Based Techniques
Array-based techniques, particularly comparative genomic hybridization (CGH) microarrays, enable the detection of gene duplications by identifying copy number variations (CNVs) across the genome. In array CGH, genomic DNA from a test sample is labeled with one fluorophore (e.g., Cy3), while reference DNA is labeled with another (e.g., Cy5), and both are hybridized to an array of immobilized DNA probes, such as bacterial artificial chromosome (BAC) clones or oligonucleotides. The ratio of fluorescence intensities for each probe reflects relative copy number differences; specifically, the log2-transformed ratio (log2(test/reference)) greater than 0 indicates copy number gains, including duplications, with values around 0.58 corresponding to a single copy gain in diploid genomes.[65][66] This method was pioneered in the late 1990s to achieve higher resolution than traditional metaphase CGH for analyzing DNA copy number alterations.[67] Resolution has evolved significantly with array designs. Early BAC-based arrays offered megabase (Mb)-scale resolution due to larger probe sizes (100-200 kb), suitable for detecting large segmental duplications but limited for smaller events. Subsequent oligonucleotide and single nucleotide polymorphism (SNP) arrays improved this to kilobase (kb) scale, with probe densities enabling detection of CNVs as small as 1-10 kb, particularly effective for recent duplications not obscured by sequence divergence. These advancements allow array CGH to identify both germline and somatic duplications, though it primarily detects unbalanced changes and may miss low-level mosaicism below 20-30% cellular prevalence.[68][69] In applications, array CGH has been instrumental in population genetics to map CNV landscapes, revealing widespread gene duplications contributing to human genetic diversity, as seen in studies profiling hundreds of individuals. In disease diagnostics, it aids in identifying pathogenic duplications associated with developmental disorders, congenital anomalies, and cancers, often as a first-line test replacing karyotyping due to its genome-wide coverage. However, a key limitation is its inability to readily distinguish tandem duplications (adjacent copies) from dispersed ones (non-adjacent), as it reports net copy number without structural context, necessitating orthogonal methods like fluorescence in situ hybridization for clarification.[70][71] A notable example from the 2000s involved array CGH in the Human Genome Project era, where BAC-based platforms identified thousands of segmental duplications and associated CNVs, contributing to assemblies like hg17 and hg18 by highlighting duplication hotspots prone to genomic instability. For instance, high-density aCGH experiments targeted these regions, uncovering over 1,400 copy-number variable regions (CNVRs) in diverse human populations and linking duplications to evolutionary expansions in gene families like those involved in immunity.[72]Sequencing Approaches
Next-generation sequencing (NGS) technologies have revolutionized the detection of gene duplications by enabling high-throughput analysis of copy number variations (CNVs) and structural variants (SVs) at base-pair resolution.[73] Read-depth analysis, a primary method in NGS, quantifies duplication events by measuring the normalized coverage of sequencing reads across genomic regions, where increased read depth indicates copy number gains.[74] Paired-end mapping complements this by identifying SVs, including duplications, through discrepancies in the expected distance or orientation between read pairs, which signal insertions or rearrangements.[74] These approaches build on earlier array-based techniques as precursors for CNV detection but offer superior resolution for mapping duplication breakpoints.[73] Long-read sequencing technologies, such as PacBio's single-molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT), address limitations of short-read NGS by producing reads spanning tens to hundreds of kilobases, effectively resolving complex gene duplications within repetitive genomic contexts.[75] These methods excel at assembling segmental duplications—low-copy repeats with high sequence identity—by spanning homologous regions that short reads often collapse or misalign.[76] For instance, polyploid phasing algorithms applied to long-read data have enabled the de novo assembly of duplicated loci, distinguishing alleles in heterozygous duplications.[75] In the 2020s, advances in long-read sequencing have significantly improved the resolution of segmental duplications exhibiting greater than 95% sequence identity, with complete telomere-to-telomere assemblies revealing previously hidden duplication structures in the human genome.[77] These improvements stem from enhanced base-calling accuracy and hybrid assembly pipelines integrating short- and long-read data, achieving near-perfect reconstruction of duplicated regions that were intractable in earlier drafts.[78] Integration of sequencing with CRISPR-Cas9 enrichment has further advanced validation, where targeted capture of duplicated loci followed by long-read sequencing confirms structural variants and resolves causal alleles in complex regions.[79] Despite these progresses, challenges persist, particularly with short-read sequencing in repetitive regions, where high sequence similarity leads to mapping ambiguities and false positives in duplication calls.[80] Quantification errors in read-depth analysis are also common due to biases from GC content or mappability, potentially under- or overestimating copy numbers in duplicated segments.[81] Long-read technologies mitigate some issues but face higher per-base error rates, necessitating computational polishing for accurate duplication annotation.[82] Hi-C sequencing provides a complementary 3D contextual view for duplication detection by capturing chromatin interactions, revealing spatial proximity between duplicated loci that indicates functional or evolutionary relationships.[83] Recent pangenome studies from 2023 to 2025 have leveraged these sequencing approaches to uncover hidden duplications across diverse human populations, with graph-based pangenomes identifying novel SVs in non-reference alleles that short-read methods missed.[84] For example, the Human Pangenome Reference Consortium's 2023 assembly highlighted population-specific gene duplications through long-read integration, enhancing our understanding of structural variation diversity.[84] The 2025 Data Release 2 further expanded the pangenome with additional phased diploid assemblies from diverse ancestries, improving the identification of population-specific gene duplications and structural variants.[85]Nomenclature and Annotation
Naming Conventions
Gene duplication results in paralogous genes that require standardized nomenclature to facilitate consistent scientific communication and database integration. The Human Genome Organisation (HUGO) Gene Nomenclature Committee (HGNC) establishes these conventions for human genes, ensuring unique symbols that reflect evolutionary relationships without implying unverified functions.[86] For paralogs arising from duplication, HGNC assigns a shared root symbol followed by distinguishing suffixes, typically Arabic numerals (e.g., -1, -2) or letters (e.g., A, B) based on sequence similarity, chromosomal location, or inferred function. Gene families, often expanded by duplications, use prefixes like CYP for the cytochrome P450 superfamily, with suffixes such as CYP2D6 indicating specific members. Pseudogenes, which are non-functional duplicates, receive a "P" suffix, as in CYP2D7P, to denote their inactivated status. These rules prioritize stability, with updates only for newly resolved duplications or to correct ambiguities, overseen by HGNC in collaboration with international experts.[86] Naming principles emphasize brevity and specificity: chromosomal location informs symbols for genes of unknown function (e.g., location-based identifiers), while sequence homology or functional clues guide family assignments. However, challenges arise with ancient duplications, where extensive sequence divergence creates ambiguities in paralog identification and orthology assignment, complicating consistent labeling across species. The HGNC mitigates this through rigorous review, but entrenched provisional names (e.g., FAM for "family with sequence similarity") can persist until better evidence emerges.[86][87] A prominent example is the HOX gene clusters, products of ancient whole-genome duplications, where paralogs are named by cluster (e.g., HOXA, HOXB) and positional numeral (e.g., HOXA1, HOXB1), reflecting their collinear arrangement and shared homeobox domain. This system highlights duplication events while avoiding functional speculation.[86]Database Resources
Several key databases serve as essential repositories for gene duplication data, enabling researchers to access annotated genomic regions, evolutionary histories, and comparative analyses across species. These resources integrate high-throughput sequencing data to facilitate the study of duplication events, their ages, and functional implications, while providing tools for visualization and programmatic access. Ensembl's Compara database offers comprehensive paralog trees derived from gene orthology and paralogy predictions, where paralogues are identified as genes sharing a most recent common ancestor via duplication events. These trees annotate duplication ages through reconciliation with species trees, distinguishing recent from ancient duplications, and include synteny viewers for visualizing conserved genomic blocks affected by duplications. The platform supports API access for querying homology data and has incorporated 2020s sequencing advancements, such as long-read assemblies, in its latest releases, including Ensembl 115 (September 2025) with expanded vertebrate and invertebrate genome coverage.[88][89][90] The UCSC Genome Browser provides dedicated tracks for segmental duplications, displaying putative duplicated regions with color-coded levels of support based on sequence similarity and alignment evidence (data from 2013, last updated 2014 for GRCh38/hg38). This resource aids in identifying low-copy repeats and tandem duplicates within human and other mammalian genomes. While the browser integrates recent assemblies like GRCh38.p14 (2023), the specific segmental duplication track has not been updated; for refined boundaries from newer data, such as the Telomere-to-Telomere (T2T) Consortium's CHM13 assembly (2022), users may employ custom tracks or external resources.[91][92] For plant-specific analyses, Phytozome hosts comparative genomics data across hundreds of Archaeplastida species, using tools like InParanoid-DIAMOND to cluster paralogous gene families and detect duplication-driven expansions. It features synteny browsers via JBrowse and BioMart for cross-species queries, with post-2020 updates including over 149 new genomes (up to October 2025, e.g., Nicotiana benthamiana v1.0) and improved homology alignments from long-read sequencing. As of Phytozome v14 (2025), it incorporates pangenome datasets such as BrachyPan (54 Brachypodium distachyon lines) and CowpeaPan (8 Vigna unguiculata genomes) to enhance duplication detection in diverse accessions.[93] DupMasker is a specialized annotation tool for segmental duplications, particularly in primates, employing a library of consensus duplicon sequences (based on 2008 data) to mask and annotate duplicated regions with metrics like percent divergence and alignment scores. Integrated with RepeatMasker, it outputs GFF-formatted results for downstream analysis and supports modern search engines like RMBlast. For analyses with recent primate assemblies, supplementation with updated repeat libraries is recommended.[94][95] OrthoDB complements these by cataloging orthologs and paralogs across eukaryotes and prokaryotes, using hierarchical orthology inference to distinguish duplication-derived paralogs from speciation-derived orthologs. This enables cross-species comparisons of gene family evolution, with tools for phyloprofiling duplication patterns in diverse taxa. The latest version, OrthoDB v12.2 (updated 2024), covers 5,952 eukaryotic species with expanded gene loci coordinates and CDS data.[96][97]| Database | Key Features for Gene Duplication | Primary Organisms | Access Methods |
|---|---|---|---|
| Ensembl Compara | Paralog trees, duplication age annotation, synteny viewers | Vertebrates, invertebrates | Web interface, API, BioMart |
| UCSC Genome Browser | Segmental dups tracks with similarity levels (2013 data, updated 2014) | Mammals (e.g., human) | Interactive browser, custom tracks |
| Phytozome | Paralogy clustering, synteny via JBrowse, pangenome datasets | Plants (Archaeplastida) | BioMart, genome browsers |
| DupMasker | Duplicon annotation, divergence metrics (2008 library) | Primates | Command-line tool, GFF output |
| OrthoDB | Ortholog-paralog distinction, phyloprofiles (v12.2, 2024) | Eukaryotes, prokaryotes | Web search, downloads |
