Hubbry Logo
National Center for Biotechnology InformationNational Center for Biotechnology InformationMain
Open search
National Center for Biotechnology Information
Community hub
National Center for Biotechnology Information
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
National Center for Biotechnology Information
National Center for Biotechnology Information
from Wikipedia

The National Center for Biotechnology Information (NCBI)[1][2] is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

Key Information

The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an important resource for bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for biomedical literature. Other databases include the NCBI Epigenomics database. All these databases are available online through the Entrez search engine. NCBI was directed by David Lipman,[2] one of the original authors of the BLAST sequence alignment program[3] and a widely respected figure in bioinformatics.

GenBank

[edit]

NCBI had responsibility for making available the GenBank DNA sequence database since 1992.[4] GenBank coordinates with individual laboratories and other sequence databases, such as those of the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ).[4]

Since 1992, NCBI has grown to provide other databases in addition to GenBank. NCBI provides the Gene database, Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D protein structures), dbSNP (a database of single-nucleotide polymorphisms), the Reference Sequence Collection, a map of the human genome, and a taxonomy browser, and coordinates with the National Cancer Institute to provide the Cancer Genome Anatomy Project. The NCBI assigns a unique identifier (taxonomy ID number) to each species of organism.[5]

The NCBI has software tools that are available through web browsers or by FTP. For example, BLAST is a sequence similarity searching program. BLAST can do sequence comparisons against the GenBank DNA database in less than 15 seconds.

NCBI Bookshelf

[edit]

The NCBI Bookshelf[6] is a collection of freely accessible, downloadable, online versions of selected biomedical books. The Bookshelf covers a wide range of topics including molecular biology, biochemistry, cell biology, genetics, microbiology, disease states from a molecular and cellular point of view, research methods, and virology. Some of the books are online versions of previously published books, while others, such as Coffee Break, are written and edited by NCBI staff. The Bookshelf is a complement to the Entrez PubMed repository of peer-reviewed publication abstracts in that Bookshelf contents provide established perspectives on evolving areas of study and a context in which many disparate individual pieces of reported research can be organized.[citation needed]

Basic Local Alignment Search Tool (BLAST)

[edit]

BLAST is an algorithm used for calculating sequence similarity between biological sequences, such as nucleotide sequences of DNA and amino acid sequences of proteins.[7] BLAST is a powerful tool for finding sequences similar to the query sequence within the same organism or in different organisms. It searches the query sequence on NCBI databases and servers and posts the results back to the person's browser in the chosen format. Input sequences to the BLAST are mostly in FASTA or GenBank format while output could be delivered in a variety of formats such as HTML, XML formatting, and plain text. HTML is the default output format for NCBI's web-page. Results for NCBI-BLAST are presented in graphical format with all the hits found, a table with sequence identifiers for the hits having scoring related data, along with the alignments for the sequence of interest and the hits received with analogous BLAST scores for these.[8]

Entrez

[edit]

The Entrez Global Query Cross-Database Search System is used at NCBI for all the major databases such as Nucleotide and Protein Sequences, Protein Structures, PubMed, Taxonomy, Complete Genomes, OMIM, and several others.[9] Entrez is both an indexing and retrieval system having data from various sources for biomedical research. NCBI distributed the first version of Entrez in 1991, composed of nucleotide sequences from PDB and GenBank, protein sequences from SWISS-PROT, translated GenBank, PIR, PRF, PDB, and associated abstracts and citations from PubMed. Entrez is specially designed to integrate the data from several different sources, databases, and formats into a uniform information model and retrieval system which can efficiently retrieve that relevant references, sequences, and structures.[10]

Gene

[edit]

Gene has been implemented at NCBI to characterize and organize the information about genes. It serves as a major node in the nexus of the genomic map, expression, sequence, protein function, structure, and homology data. A unique GeneID is assigned to each gene record that can be followed through revision cycles. Gene records for known or predicted genes are established here and are demarcated by map positions or nucleotide sequences. Gene has several advantages over its predecessor, LocusLink, including, better integration with other databases in NCBI, broader taxonomic scope, and enhanced options for query and retrieval provided by the Entrez system.[11]

Protein

[edit]

Protein database maintains the text record for individual protein sequences, derived from many different resources such as NCBI Reference Sequence (RefSeq) project, GenBank, PDB, and UniProtKB/SWISS-Prot. Protein records are present in different formats including FASTA and XML and are linked to other NCBI resources. Protein provides the relevant data to the users such as genes, DNA/RNA sequences, biological pathways, expression and variation data, and literature. It also provides the predetermined sets of similar and identical proteins for each sequence as computed by the BLAST. The Structure database of NCBI contains 3D coordinate sets for experimentally determined structures in PDB that are imported by NCBI. The Conserved Domain database (CDD) of protein contains sequence profiles that characterize highly conserved domains within protein sequences. It also has records from external resources like SMART and Pfam. There is another database of proteins known as Protein Clusters database, which contains sets of proteins sequences that are clustered according to the maximum alignments between the individual sequences as calculated by BLAST.[12]

Pubchem database

[edit]

PubChem database of NCBI is a public resource for molecules and their activities against biological assays. PubChem is searchable and accessible by Entrez information retrieval system.[13]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The National Center for Biotechnology Information (NCBI) is a division of the National Library of Medicine (NLM) within the (NIH), established in November 1988 by Public Law 100-607 to advance biomedical research through computational tools and databases. Its primary mission is to develop innovative information technologies that support the understanding of fundamental molecular and genetic processes underlying health and disease, while providing free access to biomedical and genomic data for scientists, health professionals, and the public worldwide. NCBI achieves this by maintaining over 30 interconnected repositories and knowledgebases containing more than 4.6 billion records as of 2024, including critical resources for genomic sequences, protein structures, and literature citations. Founded amid growing needs for managing the explosion of biotechnology data in the late 1980s, NCBI was created through the convergence of legislative efforts, scientific advocacy, and NIH initiatives to centralize information services. Under the leadership of figures like Donald A.B. Lindberg, who served as NLM director, the center rapidly expanded to address challenges in data storage, retrieval, and analysis for emerging fields like . Today, NCBI operates as an international hub for bioinformatics, hosting flagship databases such as GenBank for nucleotide sequences, PubMed for over 39 million biomedical literature citations, Gene for gene-centered information, and Protein for sequence and structure data, all integrated via the Entrez search system to facilitate cross-database queries. Organizationally, NCBI is structured into specialized branches to support its multifaceted operations, including the Branch (CBB) for algorithm development and genomic analysis tools, the Branch (IEB) for software infrastructure and user interfaces, and the Information Resources Branch (IRB) for data curation and public outreach. These units collaborate with intramural researchers and external partners to ensure the accuracy, accessibility, and timeliness of resources, which underpin advancements in areas like , infectious disease tracking, and . Beyond databases, NCBI provides software tools such as BLAST for sequence alignment and educational programs to train users in bioinformatics, reinforcing its role as a cornerstone of global scientific infrastructure.

History and Establishment

Founding and Early Development

The National Center for Biotechnology Information (NCBI) was established in November 1988 as a division of the National Library of Medicine (NLM) within the (NIH), pursuant to the Health Omnibus Programs Extension of 1988 (Public Law 100-607), signed into law by President on November 4, 1988. This creation was driven by the need to centralize and advance the handling of rapidly expanding biomedical data, particularly in , amid growing recognition of 's role in health research. The legislation specifically tasked NCBI with designing, developing, and implementing automated systems for the collection, storage, retrieval, and dissemination of information in and . From its inception, NCBI's primary focus was managing the exponential growth of DNA sequence data, which had surged in the 1980s due to advancements in sequencing technologies and early initiatives precursor to the Human Genome Project, such as genome mapping efforts and international collaborations on nucleotide databases. By the late 1980s, the volume of genetic sequence information was doubling approximately every 18 months, necessitating dedicated computational infrastructure to organize and analyze this influx for researchers worldwide. NCBI began operations with a modest budget of $8 million and a staff of 12, emphasizing the development of tools for sequence analysis and database integration to support emerging fields like genomics. A pivotal early event was the transfer of responsibility for , the premier public database of nucleotide sequences, to NCBI in October 1992. GenBank had been launched in 1982 under the management of the , in collaboration with the (EMBL) and the DNA Data Bank of Japan (DDBJ), to address the burgeoning need for a centralized repository amid rising outputs. This transition enabled NCBI to provide unified oversight, enhancing data standardization and accessibility through coordinated international efforts under the International Nucleotide Sequence Database Collaboration (INSDC). Dr. David J. Lipman served as NCBI's founding director from 1989 to 2017, playing a crucial role in integrating into its core operations. Recruited from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), where he had developed early tools like the algorithm, Lipman oversaw the center's initial buildup and fostered innovations in bioinformatics software, laying the groundwork for NCBI's evolution into a key resource for biomedical data management.

Key Milestones and Expansion

In the 1990s, NCBI achieved several foundational technological advancements that solidified its role in bioinformatics. The Basic Local Alignment Search Tool (BLAST) was introduced in 1990, providing a rapid algorithm for comparing biological sequences and enabling efficient similarity searches across databases. In 1991, the system launched as an integrated search and retrieval platform, linking disparate NCBI databases like and allowing users to navigate related information seamlessly. Throughout the decade, NCBI deepened its international collaboration through the International Nucleotide Sequence Database Collaboration (INSDC), partnering with the DNA Data Bank of (DDBJ) and the (EMBL) to synchronize sequence data exchange and standards, ensuring global consistency in genomic archiving. The 2000s marked NCBI's expansion into open-access resources and support for large-scale initiatives. In 2000, (PMC) debuted as a free digital archive for full-text biomedical literature, promoting widespread dissemination of scientific publications and aligning with emerging open-access policies. Following the completion of the in 2003, NCBI enhanced its genomic tools, including the release of the first reference sequence assembly, which facilitated advanced analysis and annotation of human genetic data. From the to the , NCBI focused on clinical and pandemic-related expansions amid exponential data growth. ClinVar was publicly launched in April 2013 as an archive aggregating genomic variants and their relationships to human health, supporting clinical interpretation and research into genetic diseases. In response to the , NCBI introduced specialized portals in 2020, such as LitCovid for curating literature and dedicated resources for viral genome sequences, accelerating global research efforts. GenBank's holdings surpassed 19.6 trillion base pairs by 2023 and reached 34 trillion base pairs as of 2025, reflecting sustained annual growth driven by high-throughput sequencing technologies. NCBI has increasingly adopted policy frameworks emphasizing accessibility and reusability. Through PMC and other initiatives, it has championed open-access mandates, requiring public funding recipients to deposit research outputs freely. More recently, NCBI has integrated (Findable, Accessible, Interoperable, Reusable) data principles into its operations, enhancing metadata standards and to support reproducible biomedical research. Under the leadership of figures like David Lipman, who served as director from 1989 to 2017, these developments have driven NCBI's evolution into a cornerstone of global infrastructure.

Organizational Structure

Position within NIH and NLM

The National Center for Biotechnology Information (NCBI) operates as a division within the National Library of Medicine (NLM), which is one of the 27 institutes and centers that constitute the (NIH). The NIH itself falls under the U.S. Department of Health and Human Services, forming a key component of the federal government's biomedical research infrastructure. This hierarchical placement positions NCBI to leverage NIH's broad research ecosystem while focusing on and information services through NLM's framework. NLM's integration into NIH occurred during a 1968 reorganization of the U.S. Service, which elevated NLM from an independent bureau to a formal NIH component to better align resources with expanding biomedical research needs. NCBI was specifically authorized by 100-607 in 1988, establishing it as a dedicated center under NLM to develop automated systems for managing and disseminating data. This legal foundation ensures NCBI's role in coordinating information across federal health agencies. Funding for NCBI derives primarily from annual federal appropriations to NIH, channeled through NLM's budget, which totaled approximately $472 million in 2023. Additional support comes via targeted NIH grants for collaborative projects. NCBI maintains close ties with NLM's core library functions, such as cataloging and access to biomedical literature, enhancing integrated data retrieval. It also partners with other NIH entities, including the (NHGRI), on genomics efforts like the platform for secure data sharing and analysis.

Divisions and Leadership

The National Center for Biotechnology Information (NCBI) is organized into three primary branches that handle distinct aspects of its operations: the Computational Biology Branch (CBB), the Information Engineering Branch (IEB), and the Information Resources Branch (IRB). The CBB focuses on developing algorithms and tools for sequence analysis, protein structure and function prediction, chemical informatics, and genome assembly, enabling advanced computational approaches to biological data interpretation. The IEB is responsible for engineering software and infrastructure to support database management, search functionalities, and user interfaces, including services for macromolecular structures and protein domains. Meanwhile, the IRB oversees data curation, annotation, and maintenance of core resources like GenBank, ensuring high-quality, standardized biological information for public access. Leadership at NCBI is currently provided by Acting Director Kim D. Pruitt, Ph.D., guiding the center's strategic direction and integration of production services. Deputy Director Heidi J. Sofia, Ph.D., supports operational oversight, with an administrative team including Officer Timothy Valin. NCBI employs approximately 500 staff members, organized into interdisciplinary teams that combine expertise in , , and to address complex challenges in biomedical data handling. Program oversight is provided by the Board of Scientific Counselors, an advisory body that reviews research activities, evaluates scientific progress, and recommends priorities to enhance NCBI's contributions to and health sciences. This structure fosters collaboration among biologists, informaticians, and librarians to develop and maintain data standards, ensuring interoperability across NCBI's extensive resources within the broader framework.

Mission and Objectives

Core Goals

The National Center for Biotechnology Information (NCBI) was established with a stated mission to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control and . This emphasizes the creation of innovative tools and systems that enable researchers worldwide to explore genomic, proteomic, and biomedical efficiently. By focusing on technology-driven solutions, NCBI aims to bridge gaps in scientific discovery, ensuring that complex biological information is accessible and analyzable to advance knowledge of mechanisms and therapeutic interventions. A primary goal of NCBI is to provide free public access to a vast array of biomedical and genomic data, democratizing scientific resources and eliminating financial barriers for researchers, clinicians, and the public. This commitment supports global collaboration by maintaining open repositories such as and , where users can retrieve and utilize data without subscription fees. Additionally, NCBI promotes data sharing by encouraging submissions to its databases, fostering a culture of openness that accelerates collective progress in and genetics. To support research, NCBI develops and maintains standardized data formats and protocols, ensuring consistency and compatibility across diverse datasets and tools. Examples include the for and the ASN.1-based structures for genomic annotations, which facilitate seamless integration and analysis in computational workflows. Strategic priorities include enhancing through application programming interfaces (APIs), such as the E-utilities, which allow programmatic access to multiple databases and enable automated querying for large-scale studies. NCBI also addresses challenges posed by the exponential growth in biological data volumes—now exceeding petabytes—by investing in scalable infrastructure and efficient retrieval systems to maintain performance and reliability. These objectives align closely with the broader (NIH) goals, particularly in supporting that moves discoveries from laboratory benches to clinical applications at the bedside. By providing reliable data resources and tools, NCBI contributes to NIH's mission of uncovering knowledge to improve human health, emphasizing the integration of basic science with practical health outcomes.

Impact on Biomedical Research

The (NCBI) has profoundly influenced biomedical research by providing essential genomic and bibliographic resources that underpin major scientific breakthroughs. For instance, NCBI's database facilitated the discovery and characterization of the CRISPR-Cas system, a revolutionary genome-editing technology, by offering researchers access to vast bacterial sequence data that revealed the repeated DNA motifs and associated genes critical to its function. Similarly, during the , the rapid upload of the first genome to , publicly released on January 12, 2020, enabled global scientists to sequence variants, develop diagnostics, and design vaccines, accelerating the response to the outbreak and saving countless lives through timely . These contributions highlight NCBI's role in transforming raw biological data into actionable insights that drive innovation in , infectious disease research, and beyond. NCBI's resources demonstrate immense scale and utility, with millions of daily visitors to its website and over 115 terabytes of data downloaded each day as of 2023, reflecting widespread adoption by the global research community. These databases are integral to studies, serving as foundational references in high-impact publications and supporting across disciplines. Economically, NCBI's open-access model amplifies the value of U.S. taxpayer investments in biomedical research; as part of the (NIH), it contributes to an estimated $92.89 billion in annual economic activity generated by NIH-funded efforts as of 2023, including job creation and advancements in industries. By integrating disparate data sources through systems like , NCBI has addressed key challenges such as data silos in biomedical research, enabling seamless cross-referencing of genomic, chemical, and to foster interdisciplinary . This integration promotes equity by providing free, unrestricted access to high-quality resources for researchers worldwide, including those in low- and middle-income countries who might otherwise lack such tools, thereby democratizing scientific progress and reducing global disparities in research capacity. NCBI has also shaped open science policies, notably through PubMed Central's implementation of the Bethesda Statement on Open Access Publishing (2003), which called for immediate free distribution of publicly funded biomedical research and influenced subsequent NIH mandates for public access to peer-reviewed articles. This advocacy has advanced equitable knowledge dissemination, reinforcing principles of transparency and collaboration in global health research.

Major Databases

GenBank

GenBank serves as the National Center for Biotechnology Information's (NCBI) primary repository for nucleotide sequences, functioning as a comprehensive, annotated archive of publicly available DNA and RNA sequences from diverse organisms. Established in 1982 under the auspices of the National Institutes of Health (NIH), it was initially developed to centralize genetic data for scientific access and has since evolved into a cornerstone of molecular biology research. The database hosts a vast array of content, including raw and annotated sequences accompanied by rich metadata such as taxonomic , sequence length, collection dates, and biological functions like products or regulatory elements. These records are distributed in flat , which includes the itself alongside structured annotations for easy parsing and analysis by researchers and computational tools. As of August 2025, encompasses over 258 million records, reflecting its with the number of bases doubling approximately every 18 months since inception. Recent updates include accelerated processing of sequences to support timely responses. Sequence submissions to are processed through dedicated tools designed for accessibility and validation, including BankIt—a web-based wizard for straightforward entries—and —a versatile standalone program for handling complex datasets with advanced annotation capabilities. As a key member of the International Sequence Database Collaboration (INSDC), coordinates with the European Nucleotide Archive (ENA) and the DNA DataBank of (DDBJ) to synchronize data exchanges, ensuring non-redundant global coverage and adherence to unified submission standards. GenBank undergoes bi-monthly full releases, typically on the 15th of February, April, June, August, October, and December, incorporating newly submitted sequences and revisions while maintaining for users. Annotations within records adhere to the Feature Table format, a standardized specification jointly maintained by INSDC partners to encode biological features—such as coding regions, promoters, and repeats—in a machine-readable, tabular structure that supports across . GenBank data is accessible via NCBI's system for integrated searching across related resources.

PubChem

is a comprehensive database developed and maintained by the National Center for Biotechnology Information (NCBI), serving as a central repository for on chemical molecules and their biological activities. Launched on September 16, 2004, as part of the National Institutes of Health's Molecular Libraries Roadmap Initiative, has grown into the world's largest freely accessible collection of chemical data, supporting research in , , and . As of November 2025, it contains over 122 million unique compounds, 339 million substances, and 298 million records, reflecting its expansive role in aggregating and standardizing chemical for global scientific use. Recent enhancements include the addition of literature co-occurrence data to support explorations. The database is structured into three primary components: Compound, PubChem Substance, and PubChem BioAssay. PubChem Compound focuses on unique chemical structures, providing standardized representations of molecules with such as SMILES, InChI, and molecular formulas, along with computed properties like molecular weight and logP values. PubChem Substance captures depositor-provided data, including raw experimental records and mixtures from various submitters, preserving the original context without normalization. PubChem BioAssay stores results from experiments, detailing biological activities such as binding affinities, inhibitory concentrations (), and efficacy outcomes against specific targets, enabling researchers to explore structure-activity relationships. Data in PubChem originate from diverse sources, including direct user submissions, , patents, and curated databases, with contributions from over 1,000 providers worldwide. Notable integrations include records from for drug-related annotations and from ChEBI for ontology-based chemical entity classifications, enhancing cross-referencing and interoperability with other resources. These sources ensure a broad coverage of small molecules, natural products, and bioactive compounds, with ongoing curation to maintain data quality and resolve redundancies through processes. Key features of include interactive tools for visualizing 3D conformers generated via computational modeling, which aid in understanding and interactions, as well as predictive models for endpoints like acute oral toxicity and mutagenicity based on algorithms trained on experimental data. The database supports similarity searching, substructure matching, and bioactivity filtering to facilitate and cheminformatics analyses. exhibits steady annual growth of 10–20% in records, driven by new depositions and source expansions, such as the addition of over 130 contributors in recent years, ensuring its relevance in evolving biomedical research. Users can access through the search system for integrated querying across NCBI resources.

Gene and Protein Databases

The Gene database serves as a central repository for curated gene information, encompassing over 2 million entries across thousands of species, with detailed annotations including official gene symbols, synonyms, aliases, and summaries of expression patterns derived from various experimental sources. Launched in 2000 as an evolution of earlier resources like LocusLink, it emphasizes well-studied organisms and integrates data from model organism databases to provide functional insights into gene roles, interactions, and evolutionary relationships. Complementing the Gene database, the Protein database aggregates more than 200 million protein sequences, primarily derived from translations of annotated nucleotide coding regions, with comprehensive annotations for three-dimensional structures, functional domains, and motifs often sourced through the project. Entries include cross-references to external resources such as for additional protein family and pathway information, enabling researchers to explore protein function, localization, and post-translational modifications. These databases are tightly integrated to support seamless navigation between genetic and proteomic data; for instance, records link directly to corresponding Protein entries via stable identifiers, allowing users to trace gene-to-protein relationships and examine how sequence variations influence protein products. Visualization tools like the Genome Data Viewer further enhance this integration by displaying gene loci, protein alignments, and associated annotations within assembled genomes. The databases undergo weekly updates to incorporate new submissions and annotations, with enhanced emphasis on priority model organisms such as and , including hyperlinks to the (OMIM) database for associating genes with hereditary diseases and phenotypes.

Search and Retrieval Systems

Entrez

is the 's (NCBI) primary text-based search and retrieval system, designed to integrate and provide unified access to a diverse array of over 30 , genomic, and literature databases, including resources for DNA and protein sequences, 3D structures, and biomedical publications. Launched in 1991 as a CD-ROM-based tool for cross-database querying, it has evolved into a web-accessible platform that facilitates discovery by linking related data across NCBI's resources, such as literature from and genomic sequences from . This integration allows users to perform comprehensive searches that span multiple domains of biomedical information without switching between individual database interfaces. Key features of Entrez include the E-utilities (API), which enables programmatic access for automated querying, , and linking across databases through a set of server-side utilities like ESearch for querying and EFetch for record retrieval. Additionally, the LinkOut service supports direct connections to external resources, such as full-text articles or institutional holdings, enhancing the system's utility by bridging NCBI data with third-party content. Users can leverage search history to track and combine previous queries within a session and collections via My NCBI to save and organize results for future reference, streamlining workflows for repeated or complex investigations. Entrez employs text-based indexing algorithms that process database records using controlled vocabularies, including (MeSH) for literature, to improve search precision and relevance. It supports advanced querying with Boolean operators (AND, OR, NOT) and field-specific tags, allowing users to construct sophisticated expressions, such as limiting results to titles or authors in specific databases. These capabilities ensure efficient navigation through vast datasets, with results often displayed in a unified interface showing hits across linked databases. The system handles a substantial volume of global queries, supporting millions of daily searches by researchers, clinicians, and educators, and is optimized for accessibility, including integration with mobile-responsive designs for on-the-go use. This widespread adoption underscores Entrez's role as a foundational tool for biomedical data exploration, with ongoing updates to accommodate expanding NCBI resources.

PubMed

PubMed serves as the National Center for Biotechnology Information's (NCBI) primary database for biomedical literature, offering free access to over 39 million citations drawn primarily from , alongside content from life science journals and online books. Established as an interface in 1996, it encompasses abstracts for the majority of entries and provides links to full-text versions through publisher sites or (PMC), facilitating rapid retrieval of scholarly articles dating back to 1946 via the MEDLINE core, with earlier historical coverage from predecessor indexes. The database evolved from the printed , a monthly bibliography initiated by the National Library of Medicine in 1879 to catalog medical publications, which transitioned into the electronic in 1966 and later integrated into for digital accessibility. This shift marked a pivotal advancement in literature dissemination, replacing manual indexing with computerized searching to support global biomedical research. PubMed features robust search capabilities, including advanced filters for refining results by criteria such as publication date, author, article type (e.g., or review), language, and , enabling precise targeting of relevant studies. The MyNCBI personalization tool allows users to save searches, create citation collections, set email alerts for new results, and customize display preferences across NCBI resources. Integration with ensures seamless access to over 10 million open-access full-text articles, enhancing equity in scientific information sharing. Updates occur daily, incorporating thousands of new citations to reflect emerging research in real time. 's Clinical Queries interface further supports by applying specialized filters to prioritize high-quality clinical studies, systematic reviews, and topic-specific findings, aiding clinicians and researchers in decision-making. The platform's scalability was evident during the , when it indexed over 100,000 related publications in 2020 alone, underscoring its critical role in accelerating global responses to health crises. As part of the system, enables cross-database querying for integrated literature and data exploration.

Analytical Tools

BLAST

The Basic Local Alignment Search Tool (BLAST) is a cornerstone developed in 1990 by F. Altschul and colleagues at the National Center for Biotechnology Information (NCBI) for rapidly identifying regions of local similarity between biological sequences. Designed to approximate optimal alignments more efficiently than exhaustive methods like Smith-Waterman, BLAST has become one of the most widely used tools in bioinformatics, with its original publication garnering over 100,000 citations. It supports comparisons of or protein query sequences against large databases such as for and nr for proteins to detect potential homologs or functional similarities. At its core, BLAST employs a heuristic approach to balance speed and sensitivity, beginning with an indexing phase that scans for short, exact matches known as "seeds" or "words"—typically 11 nucleotides for DNA or 3 amino acids for proteins. These seeds serve as starting points for extension: the algorithm extends alignments in both directions from each seed using a scoring matrix (e.g., for proteins) until the score drops below a user-defined threshold, discarding non-significant extensions to avoid exhaustive computation. This seed-and-extend strategy reduces search times from days to seconds for large datasets, though it may occasionally miss distant similarities. Key variants include BLASTN for nucleotide-to-nucleotide searches, optimized for high-throughput genomic queries, and BLASTP for protein-to-protein comparisons, which incorporate evolutionary substitution matrices for greater sensitivity to conserved regions. BLAST is accessible via a user-friendly web interface on the NCBI , where users input sequences and select , programs, and parameters such as the E-value threshold—a statistical measure indicating the expected number of alignments by chance (e.g., E < 0.001 for stringent hits). For advanced or , standalone software like the BLAST+ suite allows local installations on Unix, Windows, or Mac systems, enabling customized database searches and integration into pipelines. Outputs include alignment visualizations, bit scores, percent identities, and taxonomic breakdowns, aiding interpretation of results. In practice, BLAST facilitates annotation by aligning unknown sequences to annotated references, inferring functions from high-scoring matches, and supports phylogenetic studies by identifying homologs for tree construction and evolutionary inference. Notable updates include DELTA-BLAST (2012), which enhances remote homolog detection by first building a position-specific scoring matrix (PSSM) from conserved domain matches before standard BLASTP, improving sensitivity for annotation without sacrificing speed. More recent enhancements include the release of BLAST+ 2.17.0 in July 2025 for improved performance in protein searches and the adoption of ClusteredNR as the default protein database in August 2025, providing faster and more representative results by reducing redundancy. These applications have profoundly influenced biomedical research, from genome assembly to target identification.

Other Computational Tools

In addition to foundational sequence alignment tools like BLAST, the NCBI provides a suite of specialized computational tools for tasks such as PCR primer design, 3D molecular structure visualization, vector contamination screening, and genome assembly and . These tools are designed to support diverse aspects of biomedical research, from experimental design to data quality assurance and analysis, and are accessible via web interfaces or downloadable software. Primer-BLAST is an integrated tool that combines primer design capabilities with specificity checking to generate primers tailored for polymerase chain reaction (PCR) amplification of specific DNA targets. It employs the Primer3 algorithm for initial primer selection and incorporates BLAST searches against NCBI nucleotide databases to ensure the primers do not amplify unintended sequences, thereby minimizing off-target effects in experiments. This functionality is particularly useful for researchers designing probes for gene expression studies or cloning, and the tool supports input of template sequences, amplicon size specifications, and organism-specific databases for enhanced precision. CN3D, short for "see in 3D," serves as a viewer for molecular structures and sequence alignments derived from NCBI's Molecular Modeling Database (MMDB) and related resources. It enables users to interactively explore 3D models of proteins, nucleic acids, and their complexes alongside aligned sequences, highlighting features like secondary structures, domains, and evolutionary relationships through Vector Alignment Search Tool (VAST) alignments. The software facilitates educational and research applications by allowing rotations, zooms, and annotations of structures, with support for importing PDB files and exporting images or data. Available as a free desktop application for Windows, CN3D includes tutorials to guide users in retrieving and visualizing structures from . VecScreen is a contamination detection tool that scans sequences for segments originating from vectors, adapters, linkers, or PCR primers, using the UniVec database—a curated, non-redundant collection of common vector elements. It reports potential contaminants with details on match strength, position, and type, aiding submitters in cleaning sequences before deposition to , where VecScreen results are automatically reviewed as part of the validation process. This integration helps maintain the integrity of public sequence repositories by flagging issues like residual vector sequences that could confound downstream analyses. The web-based tool processes queries rapidly and provides guidelines for interpreting and editing results. The NCBI Genome Workbench was a comprehensive desktop application for genome analysis, offering modules for , , and visualization, with particular support for next-generation sequencing (NGS) data formats like BAM and FASTQ. It allowed users to perform tasks such as aligning reads to reference genomes, building contigs, and preparing submissions to through an integrated wizard, making it valuable for and large-scale projects. Although retired in March 2024 with downloads ceased, existing installations remain functional, and its features have influenced subsequent NCBI tools for data handling. Tutorials and documentation were provided to assist in workflows involving multiple sequence alignments and graphical views. All these tools are offered free of charge, with web versions enabling immediate access without installation for most functionalities, while desktop options like CN3D provided offline capabilities. NCBI maintains extensive tutorials, user guides, and webinars to promote effective use across skill levels, ensuring broad accessibility for academic, clinical, and industrial researchers.

Educational and Literature Resources

NCBI Bookshelf

The NCBI Bookshelf is a free online providing access to full-text books and documents in the biomedical and life sciences, , and . Launched in 1999 with the third edition of Molecular Biology of the Cell as its inaugural title, it serves as an aimed at supporting , , and clinical practice. The collection has expanded significantly since its inception, now encompassing more than 13,000 titles as of 2024 that include peer-reviewed monographs, NCBI-authored guides, reports, clinical guidelines, systematic reviews, reference books, technical reports, web materials, and . Content on the Bookshelf is curated to emphasize high-quality, educational resources, with a focus on open-access materials that are freely available without subscription barriers. Users can search the full text across the entire collection or within specific books through an intuitive interface, facilitating discovery of topics ranging from to policy. Notable examples include seminal texts like Molecular Biology of the Cell by Alberts et al., which exemplifies the platform's role in disseminating foundational knowledge in . Key features enhance usability and integration with broader scientific workflows: chapters and sections can be downloaded in PDF format for offline access, while built-in annotation tools allow users to highlight and note key passages. The platform integrates seamlessly with , enabling direct links from citations in Bookshelf content to related journal articles for expanded literature exploration. This connectivity underscores Bookshelf's position as a bridge between book-length resources and dynamic database searches. The collection continues to grow through annual additions, with content sourced via publisher deposits (accounting for 45% of new materials, including partnerships with entities like ), funder deposits (45%), and conversion projects (10%) that digitize legacy resources. This sustained expansion ensures the Bookshelf remains a vital, up-to-date hub for open-access educational materials in the biomedical domain.

Taxonomy and Reference Sequences

The NCBI Taxonomy database serves as a central repository for the and of organisms represented in public sequence databases, encompassing 2,706,727 taxa across , , eukaryotes, and viruses as of 2025. This curated resource, maintained in with the International Nucleotide Sequence Database (INSDC), provides standardized names and phylogenetic lineages to support the annotation of all molecular data in NCBI's databases, including , ensuring uniform identification and organization of biological entities. By assigning unique taxonomy identifiers (TaxIDs) to each entry, the database facilitates precise cross-referencing and integration of genomic, proteomic, and other sequence-based information. Complementing the Taxonomy database, the Reference Sequence (RefSeq) project offers a non-redundant, curated collection of genomic DNA, transcript, and protein sequences for major organisms, serving as a stable foundation for comparative and functional genomics. Unlike GenBank, which archives all submitted sequences without curation, RefSeq selects and annotates representative sequences to minimize redundancy and enhance reliability, with ongoing updates based on experimental validation and community input. A key component, RefSeqGene, provides standardized genomic sequences for well-characterized human gene loci, aiding in clinical genomics and variant interpretation by defining reference standards for over 20,000 human genes. Key features of the Taxonomy database include the Taxonomy Browser, an interactive tool for exploring hierarchical structures, viewing detailed lineage reports, and accessing phylogenetic trees derived from molecular data. Users can track annual updates to scientific names, synonyms, and classifications, which reflect evolving taxonomic consensus from expert curators and external authorities like the International Committee on Taxonomy of Viruses (ICTV). integrates these taxonomy assignments to link sequences directly to organismal classifications, supporting tools like BLAST for accurate and retrieval. These resources are essential for maintaining consistency in cross-database linkages within NCBI, enabling seamless between genomic records, , and analytical tools while accommodating revisions that occur yearly to incorporate new phylogenetic insights. For instance, recent updates have introduced new ranks like "" for prokaryotes and refined viral classifications, preventing discrepancies in data interpretation across global research efforts.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.