Recent from talks
Contribute something
Nothing was collected or created yet.
GenBank
View on Wikipedia| Content | |
|---|---|
| Description | Nucleotide sequences for more than 300,000 organisms with supporting bibliographic and biological annotation. |
| Data types captured |
|
| Organisms | All |
| Contact | |
| Research center | NCBI |
| Primary citation | PMID 21071399 |
| Release date | 1982 |
| Access | |
| Data format | |
| Website | NCBI |
| Download URL | ncbi ftp |
| Web service URL | |
| Tools | |
| Web | BLAST |
| Standalone | BLAST |
| Miscellaneous | |
| License | Unclear[1] |
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part of the National Institutes of Health in the United States) as part of the International Nucleotide Sequence Database Collaboration (INSDC).
In October 2024, GenBank contained 34 trillion base pairs from over 4.7 billion nucleotide sequences and more than 580,000 formally described species.[2][3]
The database started in 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months.[4][5][3]
GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.
Submissions
[edit]Only original sequences can be submitted to GenBank. Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, table2asn. Upon receipt of a sequence submission, the GenBank staff examines the originality of the data and assigns an accession number to the sequence and performs quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by Entrez or downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence-tagged site (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are most often submitted by large-scale sequencing centers. The GenBank direct submissions group also processes complete microbial genome sequences.[6][7]
History
[edit]Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory (LANL) and others established the Los Alamos Sequence Database in 1979, which culminated in 1982 with the creation of the public GenBank.[8] Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences were stored in it.
In the mid-1980s, the Intelligenetics bioinformatics company at Stanford University managed the GenBank project in collaboration with LANL.[9] As one of the earliest bioinformatics community projects on the Internet, the GenBank project started BIOSCI/Bionet news groups for promoting open access communications among bioscientists. During 1989 to 1992, the GenBank project transitioned to the newly created National Center for Biotechnology Information (NCBI).[10]


Growth
[edit]
The GenBank release notes for release 250.0 (June 2022) state that "from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months".[11][12] As of 15 June 2022, GenBank release 250.0 has over 239 million loci, 1,39 trillion nucleotide bases, from 239 million reported sequences.[11]
The GenBank database includes additional data sets that are constructed mechanically from the main sequence data collection, and therefore are excluded from this count.
| Organism | base pairs |
|---|---|
| Triticum aestivum | 2.15443744183×1011 |
| SARS-CoV-2 | 1.65771825746×1011 |
| Hordeum vulgare subsp. vulgare | 1.01344340096×1011 |
| Mus musculus | 3.0614386913×1010 |
| Homo sapiens | 2.7834633853×1010 |
| Avena sativa | 2.1127939362×1010 |
| Escherichia coli | 1.5517830491×1010 |
| Klebsiella pneumoniae | 1.1144687122×1010 |
| Danio rerio | 1.0890148966×1010 |
| Bos taurus | 1.0650671156×1010 |
| Triticum turgidum subsp. durum | 9.981529154×109 |
| Zea mays | 7.412263902×109 |
| Avena insularis | 6.924307246×109 |
| Secale cereale | 6.749247504×109 |
| Rattus norvegicus | 6.548854408×109 |
| Aegilops longissima | 5.920483689×109 |
| Canis lupus familiaris | 5.776499164×109 |
| Aegilops sharonensis | 5.272476906×109 |
| Sus scrofa | 5.179074907×109 |
| Rhinatrema bivittatum | 5.178626132×109 |
Limitations
[edit]An analysis of Genbank and other services for the molecular identification of clinical blood culture isolates using 16S rRNA sequences[13] showed that such analyses were more discriminative when GenBank was combined with other services such as EzTaxon-e[14] and the BIBI[15] databases.
GenBank may contain sequences wrongly assigned to a particular species, because the initial identification of the organism was wrong. A recent study showed that 75% of mitochondrial Cytochrome c oxidase subunit I sequences were wrongly assigned to the fish Nemipterus mesoprion resulting from continued usage of sequences of initially misidentified individuals.[16] The authors provide recommendations how to avoid further distribution of publicly available sequences with incorrect scientific names.
Numerous published manuscripts have identified erroneous sequences on GenBank.[17][18][19] These are not only incorrect species assignments (which can have different causes) but also include chimeras and accession records with sequencing errors. A recent manuscript on the quality of all Cytochrome b records of birds further showed that 45% of the identified erroneous records lack a voucher specimen that prevents a reassessment of the species identification.[20]
Another problem is that sequence records are often submitted as anonymous sequences without species names (e.g. as "Pelomedusa sp. A CK-2014" because the species are either unknown or withheld for publication purposes. However, even after the species have been identified or published, these sequence records are not updated and thus may cause ongoing confusion.[21]
See also
[edit]- Ensembl
- Human Protein Reference Database (HPRD)
- Sequence analysis
- UniProt
- List of sequenced eukaryotic genomes
- List of sequenced archaeal genomes
- RefSeq — the Reference Sequence Database
- Geneious — includes a GenBank Submission Tool
- Open science data
- Open Standard
References
[edit]- ^ The download page at UCSC says "NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims, and therefore cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in GenBank."
- ^ Eric W Sayers; Mark Cavanaugh; Karen Clark; Kim D Pruitt; Conrad L Schoch; Stephen T Sherry; Ilene Karsch-Mizrachi (7 January 2022). "GenBank". Nucleic Acids Archive. 50 (D1): D161 – D164. doi:10.1093/nar/gkab1135. PMC 8690257. PMID 34850943.
- ^ a b Sayers, Eric W; Cavanaugh, Mark; Frisse, Linda; Pruitt, Kim D; Schneider, Valerie A; Underwood, Beverly A; Yankie, Linda; Karsch-Mizrachi, Ilene (2025-01-06). "GenBank 2025 update". Nucleic Acids Research. 53 (D1): D56 – D61. doi:10.1093/nar/gkae1114. ISSN 0305-1048. PMC 11701615. PMID 39558184.
- ^ Benson D; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Wheeler, D. L.; et al. (2008). "GenBank". Nucleic Acids Research. 36 (Database): D25 – D30. doi:10.1093/nar/gkm929. PMC 2238942. PMID 18073190.
- ^ Benson D; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Sayers, E. W.; et al. (2009). "GenBank". Nucleic Acids Research. 37 (Database): D26 – D31. doi:10.1093/nar/gkn723. PMC 2686462. PMID 18940867.
- ^ "How to submit data to GenBank". NCBI. Retrieved 20 July 2022.
- ^ "GenBank Submission Types". NCBI. Retrieved 20 July 2022.
- ^ Hanson, Todd (2000-11-21). "Walter Goad, GenBank founder, dies". Newsbulletin: obituary. Los Alamos National Laboratory. Archived from the original on 2008-11-07.
- ^ LANL GenBank History
- ^ Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research. 18 (6): 1517–1520. doi:10.1093/nar/18.6.1517. PMC 330520. PMID 2326192.
- ^ a b c "GenBank release notes (Release 250)". NCBI. 15 June 2022. Retrieved 20 July 2022.
- ^ Benson, D. A.; Cavanaugh, M.; Clark, K.; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Sayers, E. W. (2012). "GenBank". Nucleic Acids Research. 41 (Database issue): D36 – D42. doi:10.1093/nar/gks1195. PMC 3531190. PMID 23193287.
- ^ Kyung Sun Park; Chang-Seok Ki; Cheol-In Kang; Yae-Jean Kim; Doo Ryeon Chung; Kyong Ran Peck; Jae-Hoon Song; Nam Yong Lee (May 2012). "Evaluation of the GenBank, EzTaxon, and BIBI Services for Molecular Identification of Clinical Blood Culture Isolates That Were Unidentifiable or Misidentified by Conventional Methods". J. Clin. Microbiol. 50 (5): 1792–1795. doi:10.1128/JCM.00081-12. PMC 3347139. PMID 22403421.
- ^ EzTaxon-e Database eztaxon-e.ezbiocloud.net (archive accessed 25 March 2021)
- ^ leBIBI V5 pbil.univ-lyon1.fr (archive accessed 25 March 2021)
- ^ Ogwang, Joel; Bariche, Michel; Bos, Arthur R. (2021). "Genetic diversity and phylogenetic relationships of threadfin breams (Nemipterus spp.) from the Red Sea and eastern Mediterranean Sea". Genome. 64 (3): 207–216. doi:10.1139/gen-2019-0163. PMID 32678985.
- ^ van den Burg, Matthijs P.; Herrando-Pérez, Salvador; Vieites, David R. (13 August 2020). "ACDC, a global database of amphibian cytochrome-b sequences using reproducible curation for GenBank records". Scientific Data. 7 (1): 268. Bibcode:2020NatSD...7..268V. doi:10.1038/s41597-020-00598-9. eISSN 2052-4463. PMC 7426930. PMID 32792559.
- ^ Li, Xiaobing; Shen, Xuejuan; Chen, Xiao; Xiang, Dan; Murphy, Robert W.; Shen, Yongyi (6 February 2018). "Detection of Potential Problematic Cytb Gene Sequences of Fishes in GenBank". Frontiers in Genetics. 9: 30. doi:10.3389/fgene.2018.00030. eISSN 1664-8021. PMC 5808227. PMID 29467794.
- ^ Heller, Philip; Casaletto, James; Ruiz, Gregory; Geller, Jonathan (7 August 2018). "A database of metazoan cytochrome c oxidase subunit I gene sequences derived from GenBank with CO-ARBitrator". Scientific Data. 5 (1). Bibcode:2018NatSD...580156H. doi:10.1038/sdata.2018.156. eISSN 2052-4463. PMC 6080493. PMID 30084847.
- ^ Van Den Burg, Matthijs P.; Vieites, David R. (22 September 2022). "Bird genetic databases need improved curation and error reporting to <scp>NCBI</scp>". Ibis. doi:10.1111/ibi.13143. eISSN 1474-919X. hdl:10261/282622. ISSN 0019-1019.
- ^ Garg, Akhil; Leipe, Detlef; Uetz, Peter (2019-12-10). "The disconnect between DNA and species names: lessons from reptile species in the NCBI taxonomy database". Zootaxa. 4706 (3): 401–407. doi:10.11646/zootaxa.4706.3.1. ISSN 1175-5334.
This article incorporates public domain material from NCBI Handbook. National Center for Biotechnology Information.
External links
[edit]- GenBank
- Example sequence record, for hemoglobin beta
- BankIt
- Sequin — a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank sequence database.
- EMBOSS — free, open source software for molecular biology
- GenBank, RefSeq, TPA and UniProt: What's in a Name?
GenBank
View on GrokipediaIntroduction and Overview
Definition and Purpose
GenBank is an open-access, annotated collection of all publicly available nucleotide sequences and their associated biological information, maintained by the National Center for Biotechnology Information (NCBI) at the U.S. National Institutes of Health (NIH).[1] As the NIH's primary genetic sequence database, it serves as a comprehensive repository designed to provide unrestricted access to DNA and RNA sequence data for the global scientific community.[5] Established in 1982 under NIH funding at Los Alamos National Laboratory, GenBank was created to centralize the rapidly expanding volume of DNA sequence data produced by early sequencing technologies, addressing the need for a centralized resource amid growing genomic research.[7] Its core objectives include facilitating scientific discovery through free and open access to genetic information, thereby supporting advancements in genomics, evolutionary biology, and medicine.[5] Specifically, it enables critical analyses such as sequence comparison, gene function prediction, and phylogenetic studies, which underpin research in molecular biology and related fields.[1] GenBank records integrate nucleotide sequences with derived protein translations, allowing users to explore coding regions and their translated products without needing separate databases.[8] As a member of the International Nucleotide Sequence Database Collaboration (INSDC), it synchronizes data daily with partner repositories ENA and DDBJ to ensure a unified global resource.[1]Scope and Content
GenBank encompasses a vast array of nucleotide sequence data, primarily consisting of DNA and RNA sequences submitted by researchers worldwide. These include genomic DNA from chromosomes and organelles, messenger RNA (mRNA) transcripts, ribosomal RNA (rRNA), transfer RNA (tRNA), and non-coding regions such as regulatory elements and introns. Each sequence entry is accompanied by rich annotations that describe biological features, including gene locations, protein products, exons, introns, coding sequences (CDS), and functional elements like promoters and polyadenylation sites. Additionally, entries link to bibliographic references, such as peer-reviewed publications, to provide context for the sequence's discovery and characterization.[9] The database's coverage is exceptionally broad, encompassing sequences from over 581,000 formally named species as well as unnamed organisms in metagenomic studies, spanning all domains of life: viruses, bacteria, archaea, and eukaryotes ranging from unicellular protists to complex multicellular organisms like plants, animals, and fungi. This includes both complete genome assemblies and partial sequences derived from targeted sequencing efforts, such as expressed sequence tags (ESTs) or amplicons from specific loci. Metagenomic samples from environmental sources, like soil microbiomes or ocean water, further extend the scope to uncultured microbial communities, enabling research into biodiversity and ecosystem dynamics. By late 2024, GenBank held sequences representing more than 4.7 billion records, with the total accumulating to approximately 34 trillion base pairs, a figure that continued to grow rapidly into 2025.[10][6] Content in GenBank is systematically organized into divisions to facilitate targeted access and management. Standard divisions categorize sequences by organism type or source, such as PRI for primate sequences (including human), ROD for rodents, PLN for plants and fungi, BCT for bacteria, VRL for viruses, and ENV for environmental samples. Specialized divisions handle high-throughput data, including WGS for whole genome shotgun assemblies, TSA for transcriptome shotgun assemblies, and GSS for genome survey sequences. This structure supports efficient storage and retrieval, with each division subdivided into numbered files (e.g., gbpri1.seq for the first part of primate sequences) to manage the enormous volume of data. As of Release 268.0 in August 2025, the database exceeded 47 trillion base pairs across traditional and set-based records.[9] A distinctive feature of GenBank is its emphasis on annotation depth and standardization, which enhances the interpretability of sequences for scientific use. Annotations employ controlled vocabularies defined by the International Nucleotide Sequence Database Collaboration (INSDC), ensuring consistent terminology for features—such as "/gene" for gene names, "/product" for protein descriptions, and "/inference" for evidence supporting predictions like similarity to known sequences or experimental validation. This richness distinguishes GenBank from raw sequence repositories, providing users with curated insights into sequence function, evolution, and variation without requiring extensive post-processing. Bibliographic links further integrate sequences with the primary literature, fostering reproducibility and advancing genomic research across disciplines.[9]History and Development
Origins and Early Years
GenBank was initiated in 1982 by Walter Goad at the Los Alamos National Laboratory (LANL), with funding from the U.S. Department of Energy (DOE) as well as contributions from the National Institutes of Health (NIH) and other agencies, to address the increasing influx of DNA sequences produced through manual sequencing methods that were becoming more prevalent in molecular biology research.[11][12] Goad, a biophysicist in LANL's Theoretical Biology and Biophysics Group, envisioned a centralized repository to collect, annotate, and distribute nucleic acid sequence data, filling a critical need as the volume of published sequences grew beyond what individual researchers could manage.[13] Early operations centered on quarterly releases of the database, distributed primarily via magnetic tapes to academic and research institutions worldwide, allowing researchers to access the data on their local systems. The inaugural public release, known as Release 3, occurred in December 1982 and included 606 sequences comprising 680,338 base pairs, reflecting the modest scale of sequence data available at the time.[2][14] Key members of the LANL team, including Christian Burks, played pivotal roles in curating entries, developing submission protocols, and ensuring data quality amid the nascent field's demands.[12] The team encountered substantial challenges from the exponential growth of sequence submissions, which rapidly outstripped the computing resources and storage capabilities of 1980s hardware, prompting ongoing optimizations in data compression and retrieval efficiency. To facilitate broad accessibility and portability across diverse computing environments, GenBank adopted a text-based flat-file format from the outset, featuring structured records with sequence data, annotations, and references, supplemented by basic indexing for keyword-based searches.[13][1] This design emphasized simplicity and interoperability, enabling easy transfer via tapes without reliance on proprietary software.[14]Key Milestones and Transitions
In 1988, the U.S. Congress established the National Center for Biotechnology Information (NCBI) within the National Library of Medicine at the National Institutes of Health (NIH) to advance computational biosciences, including the management of genetic sequence data.[3] This marked the beginning of GenBank's transition from its initial custodians at Los Alamos National Laboratory to federal oversight under NIH. The handover process spanned from 1989 to 1992, culminating in October 1992 when NCBI assumed full responsibility for GenBank's operations, data distribution, and development.[15] Concurrently, NCBI introduced the Entrez retrieval system in 1991, enabling integrated online access to GenBank sequences alongside related protein, taxonomy, and literature data, which revolutionized user interaction with the database.[3] The 1990s brought pivotal technological integrations that expanded GenBank's utility and reach. In 1990, NCBI developed the Basic Local Alignment Search Tool (BLAST), a high-speed algorithm for identifying sequence similarities against GenBank entries, facilitating rapid genomic comparisons essential for emerging molecular biology research.[3] Throughout the decade, GenBank adopted internet-based distribution methods, including anonymous FTP access and web interfaces, shifting from primary reliance on CD-ROMs to network delivery, which accelerated data sharing as submissions grew exponentially.[16] GenBank's release numbering system, initiated with Release 3 in December 1982, continued bimonthly, providing structured versioning of the flat-file database to track updates systematically.[2] The 2000s and 2010s saw GenBank adapt to the explosion of high-throughput sequencing data, driven by advances in genomic technologies. By December 2000 (Release 121), GenBank had amassed over 10 million sequences, encompassing 11 billion bases, reflecting the impact of large-scale projects like the Human Genome Project.[2] To accommodate unfinished high-throughput genomic sequences, NCBI created the High-Throughput Genomic Sequences (HTGS) division in 1999, allowing rapid deposition of draft data without full assembly.[17] By 2010, GenBank began incorporating next-generation sequencing (NGS) outputs through the Whole Genome Shotgun (WGS) division and coordination with the Sequence Read Archive (SRA), handling the surge in short-read data from platforms like Illumina, which multiplied sequence volumes by orders of magnitude.[2] From 2020 to 2025, GenBank underwent transitions to manage escalating data volumes and specialized applications, including enhanced cloud-based infrastructure for associated raw data. The COVID-19 pandemic drove a surge in viral sequence submissions, with SARS-CoV-2 genomes increasing significantly and contributing to overall database growth.[6] NCBI made SRA data, which includes raw reads linked to GenBank entries, available via cloud platforms like AWS and Google Cloud, enabling scalable access to petabyte-scale datasets without local downloads.[18][19] For metagenomics, submission guidelines were refined as of March 2025 to streamline handling of environmental and microbiome sequences, encouraging raw read submissions and detailed metadata to support assembly and annotation of uncultured microbial communities through targeted wizards and validation tools.[20][21]Organization and Collaboration
International Nucleotide Sequence Database Collaboration (INSDC)
The International Nucleotide Sequence Database Collaboration (INSDC) was established in 1987 as a formal agreement among GenBank, the European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database (now the European Nucleotide Archive or ENA at EMBL-EBI), and the DNA Data Bank of Japan (DDBJ) to coordinate the collection, annotation, and dissemination of nucleotide sequence data worldwide.[22] This collaboration arose from earlier efforts in 1986 between GenBank and EMBL to standardize data formats, with DDBJ joining to create a unified framework that prevents data redundancy and ensures comprehensive global coverage of publicly available nucleotide sequences.[23] The primary purpose is to facilitate synchronized exchange of core nucleotide data, enabling researchers to submit sequences to any partner database while guaranteeing identical access across all three archives.[24] Submitters may choose any partner database, though it is recommended to use the one closest geographically or most convenient for support: GenBank, managed by the National Center for Biotechnology Information (NCBI) in the United States; ENA at EMBL-EBI in Europe; and DDBJ, operated by the National Institute of Genetics in Japan.[24] To maintain consistency, the partners engage in daily data mirroring, exchanging new and updated records in standardized formats such as the Feature Table, which ensures that the core datasets—comprising annotated nucleotide sequences—are identical across all databases without duplication.[25] This synchronization process supports redundancy for data preservation and allows seamless querying from any INSDC portal.[22] While the core data are mirrored identically, each partner adds unique value through region-specific enhancements. For instance, GenBank incorporates U.S.-focused biological annotations linked to resources like PubMed and includes dedicated records for patent sequences derived from intellectual property filings, which are not duplicated in ENA or DDBJ but remain accessible globally via the shared framework.[26] The total holdings of the INSDC, synchronized across partners, comprise over 5.7 billion sequences as of mid-2025, underscoring the collaboration's role in scaling genomic data infrastructure.[21] In the 2020s, the INSDC has evolved to address emerging data types and accessibility needs, including joint development of standards for metagenomic and environmental sequencing data in partnership with the Genomics Standards Consortium to improve metadata consistency for microbiome and biodiversity studies.[22] Additionally, the collaboration has reinforced open data policies aligned with FAIR (Findable, Accessible, Interoperable, Reusable) principles, mandating unrestricted public access to all deposited sequences via unique accession numbers and prohibiting proprietary restrictions on core nucleotide data.[27] In 2023, the founding members signed a Founders Arrangement to formalize their collaboration, and the INSDC has since developed a Membership Arrangement to attract additional qualified nucleotide sequence archives as new members, enhancing global representation.[28][29] These updates ensure the INSDC remains adaptable to high-throughput sequencing advancements while upholding its foundational commitment to equitable global data sharing.[24]Data Management and Standards
GenBank employs a multi-tiered curation process to maintain the integrity and utility of its sequence data, involving both professional annotation by NCBI staff for high-profile or complex entries, such as those from influenza surveillance or reference genomes, and community-driven updates through author revisions.[6] NCBI staff conduct manual reviews and annotations for select sequences, ensuring accuracy in biological interpretation, while submitters can request updates or corrections post-release, which are verified and incorporated by NCBI curators.[30] All annotations in GenBank records utilize the Feature Table format, a structured system for describing sequence features like genes, exons, and regulatory elements, which facilitates consistent representation across entries.[8] Adherence to established standards is central to GenBank's data management, with the database following the International Nucleotide Sequence Database Collaboration (INSDC) Feature Table Definition (FTD) document to define feature keys, locations, and qualifiers for annotations.[31] This ensures interoperability and precision in describing biological entities, supplemented by controlled vocabularies such as those from the Sequence Ontology for terms related to genomic features.[32] Validation checks are rigorously applied during processing, encompassing automated and manual assessments of sequence integrity, such as verifying base composition and length, alongside nomenclature compliance to prevent errors in organism naming or feature labeling.[30] Internal management tools at NCBI support ongoing data quality through pipelines designed for error detection and mitigation, including contamination screening via the Foreign Contaminant Screen (FCS) tool to identify non-target sequences in submissions.[6] GenBank data are released bimonthly in versioned flat files, allowing users to track changes and access complete datasets via FTP, with daily incremental updates for timely synchronization across INSDC partners.[1] These releases incorporate version control to preserve historical records while enabling corrections.[33] Unique to GenBank's policies is the public domain status of all deposited data, permitting unrestricted use, reuse, and distribution without licensing fees, though submitters retain any applicable intellectual property rights.[1] For pre-publication sequences, NCBI handles confidential submissions by withholding them from public access until the specified release date or publication, at which point they enter the open archive.[30]Submission and Annotation
Submission Processes
Researchers contribute new nucleotide sequences to GenBank through several established pathways designed to accommodate varying submission sizes and complexities. For small-scale submissions, such as individual sequences or sets up to 500 entries or 50 kb total, the web-based BankIt tool allows users to enter data interactively via a browser interface, guiding the preparation of sequence and feature information.[25] Larger or bulk submissions, including annotated genomes, utilize the standalone tbl2asn software, which converts tabular data and FASTA files into the required ASN.1 format (.sqn) for submission. Sequencing centers and high-volume submitters often employ direct FTP uploads to NCBI servers or email submissions to [email protected], facilitating efficient transfer of extensive datasets.[34][25] All submissions require specific formats and mandatory metadata to ensure compatibility and traceability. Sequence data must be provided in FASTA format, with annotations in ASN.1 (.sqn) for structured features. Essential metadata includes the source organism (with taxonomy details), submitter and author information, publication references (if applicable), and collection details such as isolate, strain, or geographic location. These elements are verified during submission to align with International Nucleotide Sequence Database Collaboration (INSDC) standards.[35][36] The submission workflow begins with pre-submission validation using built-in tools like the validator in tbl2asn or the Submission Portal's automated checks, which detect issues such as format errors, contamination, or chimeric sequences. Once submitted, NCBI staff perform biological review, assigning provisional accession numbers typically within two working days; examples include standard nucleotide accessions like U12345 (one letter followed by five digits) or Whole Genome Shotgun (WGS) accessions such as AABM01000000. Full processing, including integration into public releases, takes days to weeks depending on complexity, after which data undergo post-submission quality control.[25][8] GenBank handles substantial submission volumes, with over 7 million new sequence records added in 2023 alone, reflecting an annual influx exceeding 1 million sequences from global researchers. To manage this scale, specialized tracks exist for high-priority data types, such as complete genomes submitted via the Genome Submission Portal and metagenomic assemblies through the Transcriptome Shotgun Assembly (TSA) pathway, ensuring streamlined processing for large-scale genomic projects.[2][37]Annotation Guidelines and Quality Control
GenBank annotations are structured using a feature table format that employs qualifier-value pairs to describe biological elements within nucleotide sequences. These pairs follow the syntax/qualifier="value", where qualifiers provide specific attributes such as gene names or product descriptions. For instance, the qualifier /gene="ABC1" identifies a gene symbol, while /product="protein X" specifies the encoded protein. This system allows for precise, machine-readable descriptions of features like coding sequences (CDS), genes, and sources.[31]
Mandatory fields ensure basic metadata integrity, with the source organism qualifier /organism required on every source feature to denote the biological origin, accompanied by /mol_type (e.g., /mol_type="genomic DNA") to classify the sequence type. Optional qualifiers enhance detail, such as /locus_tag for unique gene identifiers within a record or /note for additional context. Submitters are responsible for providing accurate annotations, with NCBI offering templates and validation tools like table2asn to facilitate compliance during submission.[38][31]
Evidence tags distinguish between experimental and computational support for annotations. The /experiment qualifier documents direct evidence, such as /experiment="northern blot", while /inference captures computational predictions, formatted as /inference="ab initio prediction:Prodigal:2.6". These tags promote transparency and reproducibility, adhering to controlled vocabularies to maintain consistency across submissions.[39]
Quality control begins with automated validation during submission processing, using tools to check sequence validity (e.g., detecting internal stop codons or invalid characters), nomenclature consistency (e.g., standardized organism names from the NCBI Taxonomy database), and potential contamination (e.g., mismatched primer sequences or unexpected organism assignments). Common errors, such as missing source descriptors or improper geographic location codes, generate discrepancy reports for correction. Incomplete or erroneous submissions may be rejected or require revisions before acceptance.[40][41]
For complex annotations, NCBI staff conduct manual reviews to verify intricate features, ensuring alignment with INSDC standards. This hybrid approach minimizes errors while handling the volume of submissions, with tools like the GenBank Submission Portal providing real-time feedback. Submitters retain ownership of annotations but must address validation issues to proceed.[30]
In the 2020s, enhancements have streamlined annotation for high-throughput data, including support for GFF3 format uploads to accommodate next-generation sequencing (NGS) assemblies and structured evidence reporting. Standards for synthetic sequences specify the SYN division and qualifiers like /organism="synthetic construct" or /note to flag engineered elements, with validation ensuring clear distinction from natural sequences. As of 2025, the Submission Portal supports uploading feature tables for eukaryotic nuclear mRNA sequences, including coding sequences (CDS) and protein annotations; the Popset database retired in January 2025, with submitters directed to use BioProject records; support for experimental and inferential Third Party Annotation (TPA) sequences ended in January 2025; and AGP files for genome assemblies are no longer accepted, with submitters instructed to use 'N's in FASTA sequences for gaps. These updates, including accelerated processing for specific datasets like influenza, reflect ongoing efforts to adapt to evolving genomic technologies.[21][8]
Access and Retrieval
User Interfaces and Tools
GenBank data is primarily accessed through the National Center for Biotechnology Information (NCBI) platforms, offering a suite of integrated tools for searching, viewing, and analyzing nucleotide sequences. The core interface for text-based retrieval is the Entrez Nucleotide database, which allows users to query GenBank records using accession numbers, keywords, author names, or organism filters. For example, entering an accession like "U49845" retrieves the full annotated sequence record, while a keyword search such as "human BRCA1 gene" yields relevant entries with links to related genomic and literature data.[42][1] Graphical browsing is facilitated by the Genome Data Viewer (GDV), a web-based tool that displays GenBank sequences in a visual format, enabling users to navigate assemblies, zoom into regions, and overlay annotations like genes and variants. GDV supports exploration of eukaryotic genomes from organisms such as humans, mice, and plants, with features for comparative viewing across species. This interface is particularly useful for contextual analysis of sequence data without needing to download files.[1] For sequence similarity searches, the Basic Local Alignment Search Tool (BLAST) integrates directly with GenBank, allowing users to input a query sequence and compare it against the nucleotide database to identify homologous regions. Options like blastn for nucleotide-to-nucleotide alignments compute statistical significance, aiding in functional inference and phylogenetic studies. BLAST results link back to original GenBank records for detailed annotation review.[43][21] Supporting tools enhance data handling and organization. The Sequence View provides an annotated display of individual records, highlighting features such as coding regions, promoters, and references in a graphical panel embedded within Entrez results. The Taxonomy Browser enables filtering and navigation of GenBank sequences by organismal hierarchy, from broad domains like Bacteria to specific strains, streamlining organism-specific queries. For bulk operations, Batch Entrez permits uploading lists of identifiers (up to thousands) to retrieve multiple records simultaneously, ideal for exporting subsets like all sequences from a particular study for local analysis.[44][45][46] Programmatic access is available via the Entrez Programming Utilities (E-utilities) API, which supports scripted searches and retrievals in languages like Python or R, including functions for fetching nucleotide data by ID or term. NCBI Datasets offers an additional API and command-line interface for genome-centric queries, with redesigned taxonomy views for easier navigation. While no dedicated mobile apps exist for GenBank, the web interfaces are responsive, allowing basic searches and views on mobile devices through browsers.[47][48][21] All these interfaces are freely accessible without login requirements for basic use, promoting open scientific collaboration, and integrate seamlessly with PubMed for linking sequences to associated publications. This no-cost model ensures broad availability to researchers worldwide.[1][49]Data Formats and Downloads
GenBank primarily distributes its data in the flat-file format, known as the GenBank Flat File (GBFF), which structures each record with a header section, a features table for annotations, and the nucleotide or protein sequence itself. The header includes fields such as LOCUS (specifying the name, length, type, and division), DEFINITION (a brief description), ACCESSION (a unique identifier), VERSION (including the GI number for versioning), SOURCE (organism details), and REFERENCE (citation information). The features table delineates annotated elements like coding sequences (CDS), genes, and regulatory regions using a standardized vocabulary, with locations and qualifiers providing precise details such as product names or translations. This format, exemplified in sample records like accession U49845 for the Saccharomyces cerevisiae TCP1-beta gene, ensures human-readable and parseable representation of complex biological data.[8] Alternative formats cater to specific use cases: FASTA provides a simplified, sequence-only output with a definition line starting with ">" followed by the accession and description, ideal for alignment tools and lacking annotations. ASN.1 (Abstract Syntax Notation One) offers a structured, binary-compatible representation for programmatic access and exchange, supporting hierarchical data like sequences and metadata in a machine-optimized way. These formats, alongside GBFF, are available for download to accommodate diverse computational needs.[1][50] Data downloads occur via the NCBI FTP site at ftp://ftp.ncbi.nih.gov/genbank/, where full bimonthly releases—such as Release 268.0 from August 2025, encompassing over 47 trillion bases and 5.9 billion records—are provided in GBFF, ASN.1, and FASTA. Incremental updates, reflecting daily additions from submissions, are also accessible to minimize bandwidth usage for users tracking recent changes. For targeted subsets, NCBI Datasets enables cloud-based access and downloads of genomic data across domains, supporting formats like FASTA for sequences, GFF3 for annotations, and JSON for metadata, integrated with GenBank records.[9][21][51] As part of the International Nucleotide Sequence Database Collaboration (INSDC) with EMBL-EBI (ENA) and DDBJ, GenBank synchronizes data using the shared Feature Table format, which employs EMBL-like flat-file structures for consistent annotation exchange, including feature keys (e.g., CDS), locations, and qualifiers (e.g., /product). XML variants of this table provide machine-readable annotations, facilitating automated parsing and interoperability across the databases.[52][31] Best practices for handling GenBank data emphasize managing file sizes—full releases often exceed 5 TB uncompressed—through gzip compression available on the FTP site, and employing version control via stable accession numbers or GI identifiers to track updates without re-downloading entire datasets. Users are advised to verify formats against official documentation to ensure compatibility with analysis pipelines.[53][54]Growth and Impact
Historical Growth Trends
GenBank's data volume has exhibited remarkable growth since its inception, doubling approximately every 18 months from 1982 onward, a pattern sustained through advancements in sequencing technologies and increased research output.[2] This exponential trajectory reflects the broader evolution of genomics, where falling sequencing costs have democratized data generation. In the 1980s, Sanger sequencing costs were around $5–10 per base pair, limiting submissions to targeted experiments and resulting in modest accumulation. By the 2020s, costs had plummeted to less than $0.01 per base pair, enabling high-throughput projects and fueling sustained expansion.[55][56] Early growth from the 1980s to 1990s was relatively linear, transitioning from hundreds of thousands of bases to tens of millions as manual and early automated sequencing methods prevailed. Release 1 in 1983 contained just 0.68 million bases from 680 sequences, primarily from small-scale studies of genes and viruses.[2] By 1990, the database had reached 51 million bases across over 41,000 sequences, driven by accumulating data from molecular biology labs worldwide. The 2000s marked a shift to exponential growth with the advent of next-generation sequencing (NGS) technologies around 2005, which drastically increased throughput and reduced per-base costs.[57] The completion of the Human Genome Project in 2003, sequencing approximately 3 billion base pairs, exemplified this surge and encouraged global submissions, propelling GenBank past 100 billion bases by 2010. The following table summarizes key milestones in GenBank releases, highlighting the scale of growth:| Release Year | Release Number | Total Bases (approximate) | Key Driver |
|---|---|---|---|
| 1983 | 1 | 0.68 million | Initial manual sequencing efforts |
| 1990 | ~50 | 51 million | Early automation and targeted genomics |
| 2000 | 114 | 11 billion | Pre-NGS high-volume projects |
| 2010 | 178 | 108 billion | NGS adoption |
| 2022 | 250 | 1.39 trillion | Metagenomics and large-scale surveys |
