Recent from talks
Contribute something
Nothing was collected or created yet.
Entrez
View on Wikipedia
The Entrez (/ɒnˈtreɪ/)[1] Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website.[2] The NCBI is a part of the National Library of Medicine (NLM), which is itself a department of the National Institutes of Health (NIH), which in turn is a part of the United States Department of Health and Human Services. The name "Entrez" (a greeting meaning "Come in" in French) was chosen to reflect the spirit of welcoming the public to search the content available from the NLM.
Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system.
Features
[edit]The Entrez front page provides, by default, access to the global query. All databases indexed by Entrez can be searched via a single query string, supporting Boolean operators and search term tags to limit parts of the search statement to particular fields. This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also linked to actual search results for that particular database.
Entrez also provides a similar interface for searching each particular database and for refining search results. The Limits feature allows the user to narrow a search, a web forms interface. The History feature gives a numbered list of recently performed queries. Results of previous queries can be referred to by number and combined via Boolean operators. Search results can be saved temporarily in a Clipboard. Users with a MyNCBI account can save queries indefinitely, and also choose to have updates with new search results e-mailed for saved queries of most databases. It is widely used in the field of biotechnology as a reference tool for students and professionals alike.
Databases
[edit]Entrez searches the following databases:
- PubMed: biomedical literature citations and abstracts, including Medline—articles from (mainly medical) journals, often including abstracts. Links to PubMed Central and other full-text resources are provided for articles from the 1990s.
- PubMed Central: free, full-text journal articles
- Site Search: NCBI web and FTP web sites
- Books: online books
- Online Mendelian Inheritance in Man (OMIM)
- Nucleotide: sequence database (GenBank)
- Protein: sequence database (GenPept)
- Genome: whole genome sequences and mapping
- Structure: three-dimensional macromolecular structures
- Taxonomy: organisms in GenBank Taxonomy
- dbSNP: single nucleotide polymorphism
- Gene:[3] gene-centered information
- HomoloGene: eukaryotic homology groups
- PubChem Compound: unique small molecule chemical structures
- PubChem Substance: deposited chemical substance records
- Genome Project: genome project information
- UniGene: gene-oriented clusters of transcript sequences
- CDD: conserved protein domain database
- PopSet: population study data sets (epidemiology)
- GEO Profiles: expression and molecular abundance profiles
- GEO DataSets: experimental sets of GEO data
- Sequence read archive: high-throughput sequencing data
- Cancer Chromosomes: cytogenetic databases
- PubChem BioAssay: bioactivity screens of chemical substances
- Probe: sequence-specific reagents
- NLM Catalog: NLM bibliographic data for over 1.2 million journals, books, audiovisuals, computer software, electronic resources, and other materials resident in LocatorPlus (updated every weekday).
Access
[edit]In addition to using the search engine forms to query the data in Entrez, NCBI provides the Entrez Programming Utilities[4] (eUtils) for more direct access to query results. The eUtils are accessed by posting specially formed URLs to the NCBI server, and parsing the XML response. There was also an eUtils SOAP interface which was terminated in July 2015.[5]
History
[edit]In 1991, Entrez was introduced in CD form. In 1993, a client-server version of the software provided connectivity with the internet. In 1994, NCBI established a website, and Entrez was a part of this initial release. In 2001, Entrez bookshelf was released and in 2003, the Entrez Gene database was developed.[6]
References
[edit]- ^ "Definition of 'entrez'". Collins Dictionary [Internet].
- ^ NCBI Resource Coordinators (2012). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research. 41 (Database issue): D8 – D20. doi:10.1093/nar/gks1189. PMC 3531099. PMID 23193264.
- ^ "Home - Gene - NCBI".
- ^ Entrez Utilities. National Center for Biotechnology Information (US). 2010.
- ^ The E-utility Web Service (SOAP). National Center for Biotechnology Information (US). 23 January 2015.
- ^ Smith, Kent. "A Brief History of NCBI's Formation and Growth". The NCBI Handbook [Internet]. 2nd edition. Retrieved 3 May 2014.
External links
[edit]Entrez
View on GrokipediaIntroduction
Purpose and Scope
Entrez serves as the National Center for Biotechnology Information's (NCBI) primary text-based search and retrieval system, designed to integrate diverse biomedical databases for unified querying across literature, molecular sequences, and related resources.[1] Developed by NCBI, a division of the U.S. National Library of Medicine, it enables users to perform cross-database searches that connect disparate data types, such as linking a gene sequence to its associated publications or structural models.[1] The core purpose of Entrez is to facilitate efficient discovery, retrieval, and interconnection of biomedical information, supporting researchers, clinicians, and educators in navigating complex scientific datasets.[1] By providing a single interface for querying over 30 NCBI databases—including those on DNA and protein sequences, genes, genomes, and genetic variations—it streamlines access to interconnected knowledge without requiring users to switch between isolated tools.[1] This integration addresses the need for cohesive exploration in molecular biology, where related data often spans multiple domains.[4] Entrez's scope is limited to public-domain biomedical data hosted by NCBI, encompassing molecular, genomic, and literature resources while excluding proprietary or non-biomedical content.[1] It emphasizes free, open access to these resources worldwide, with no subscription barriers as of 2025, ensuring broad availability for global scientific use.[1] Historically, Entrez was first released in 1991 to resolve the fragmented access to molecular biology databases that characterized the pre-1990s era.[2]Integration with NCBI Resources
Entrez serves as the primary unified interface for accessing and retrieving data from the National Center for Biotechnology Information (NCBI)'s extensive suite of over 30 interconnected databases and tools, enabling users to perform cross-resource searches without needing to navigate multiple standalone platforms.[1] This integration facilitates seamless transitions from Entrez search results to specialized NCBI tools, such as BLAST for sequence similarity searches, Primer-BLAST for designing PCR primers against specific templates, and ClinVar for exploring clinically relevant genetic variants.[1][5] By linking query outputs directly to these resources, Entrez supports efficient workflows for researchers, clinicians, and educators engaging with biomedical data.[1] A key example of this cross-integration is how Entrez queries can feed into visualization and analysis platforms like the Genome Data Viewer and the NCBI Datasets resource, which resulted from the June 2024 merger of the legacy Entrez Genome and Assembly websites to provide streamlined access to genome assemblies and related metadata.[5] Users can initiate a search in Entrez for a gene or sequence, then transition to Datasets for downloading complete genome datasets or to the Genome Data Viewer for interactive browsing of chromosomal contexts, annotations, and alignments.[5] This interconnected approach ensures that data from sources like the Sequence Read Archive (SRA), which alone exceeds 47 petabytes, is accessible through a single entry point.[5] The benefits of Entrez's integration extend to providing "one-stop" access to NCBI's vast repository, encompassing 4.6 billion records across 31 knowledgebases as of August 2024, while handling the underlying indexing and retrieval processes to simplify use for non-specialized users.[5] Entrez employs controlled vocabularies and ontologies, notably Medical Subject Headings (MeSH) for literature indexing in PubMed and the NCBI Taxonomy for organism classification, to enable standardized, precise querying across disparate resources.[1][6] These ontologies promote consistent data linkage and discovery, reducing ambiguity in searches involving biomedical terms or evolutionary relationships.[1]Supported Databases
Literature and Biomedical Databases
Entrez provides access to several key databases focused on biomedical literature and publications, enabling researchers to search, retrieve, and analyze citations, abstracts, and full-text content. These resources form the backbone of literature-based inquiries in the biomedical sciences, supporting evidence-based research and knowledge synthesis. PubMed serves as the primary literature database within Entrez, containing more than 39 million citations and abstracts of biomedical literature sourced from MEDLINE, life science journals, and online books.[7] It includes links to full-text articles where available and employs Medical Subject Headings (MeSH) for precise indexing and retrieval, facilitating targeted searches across diverse topics in medicine and biology.[8] PubMed's coverage extends to journals from the 1940s onward, with comprehensive indexing beginning in 1966 and retrospective inclusion of earlier citations through OLDMEDLINE for pre-1966 literature.[9] PubMed Central (PMC) functions as an open-access subset of PubMed, offering free full-text access to a growing archive of biomedical and life sciences journal articles deposited by publishers and authors. As of 2025, PMC supports compliance with the 2024 NIH Public Access Policy, which mandates public access to NIH-funded research outputs no later than 12 months after publication, effective July 1, 2025, thereby enhancing the dissemination of peer-reviewed content.[10] The NCBI Bookshelf complements these resources by providing free online access to full-text books, reports, and documents in the biomedical, life sciences, health care, and medical humanities fields.[11] Integrated into Entrez, Bookshelf enables contextual reading alongside journal literature, with searchable content from more than 13,000 titles that include authoritative textbooks, technical reports, and educational materials to support in-depth study and reference.[12] These databases collectively allow Entrez users to perform unified searches across literature holdings, linking citations to related biomedical data for holistic research exploration. Additional resources in this category include the Online Mendelian Inheritance in Man (OMIM) database, which catalogs genes and genetic phenotypes associated with inherited diseases.[13]Molecular Sequence and Gene Databases
The Nucleotide database in Entrez serves as a comprehensive repository for DNA and RNA sequences, primarily through its integration with GenBank, the annotated collection of publicly available nucleotide sequences submitted by researchers worldwide. GenBank, established in 1982, contains over 5.9 billion records encompassing 47.01 trillion bases as of release 268.0 in August 2025, covering sequences from viruses, prokaryotes, eukaryotes, and organelles.[14] Each record includes detailed annotations such as gene names, protein products, biological source, and literature references, facilitating functional analysis and comparative genomics. Submission to GenBank follows standardized guidelines outlined by the International Nucleotide Sequence Database Collaboration (INSDC), ensuring data quality through validation tools like the Submission Portal and BankIt, which support formats including FASTA and feature annotations for exons, introns, and regulatory elements. The Protein database in Entrez provides a centralized collection of amino acid sequences derived mainly from the conceptual translations of coding regions in nucleotide records, augmented by curated entries from sources like RefSeq, Swiss-Prot, and PDB. This database enables researchers to perform sequence alignments, homology searches, and functional predictions using integrated tools such as BLAST for identifying similar proteins across species. With a focus on non-redundant representations where possible, it supports applications in structural biology, evolutionary studies, and drug discovery by linking sequences to experimental data like enzymatic activities and post-translational modifications. Entrez Protein emphasizes practical utilities, including multiple sequence alignment viewers and prediction algorithms for secondary structure and domains, enhancing its role in proteomics workflows.[15] Entrez Gene offers a gene-centered view of genomic information, aggregating curated records from RefSeq and other sources to provide summaries of gene function, location, expression patterns, and interactions for organisms ranging from bacteria to humans. Each gene record includes details on orthologs across species, genetic variants, pathways, and expression data from sources like GEO, with over 50 million loci documented as of 2025. In 2025, NCBI introduced redesigned Gene pages through the Datasets tool, featuring an intuitive interface for downloading sequences, annotations, and metadata in formats like JSON or TSV, improving accessibility for bulk analysis and visualization of gene models. This update integrates variant information from dbSNP, allowing users to explore SNPs, indels, and their clinical implications directly within gene contexts.[16][17] Key to navigating these databases are Entrez's support for standard data formats and identification systems, such as FASTA for sequence retrieval and display, which simplifies importing data into analysis software like sequence aligners or phylogenetic tools. Accession numbers serve as stable identifiers, with the legacy GI (GenInfo Identifier) system supplemented by unique IDs (UIDs) for versioning and tracking updates, ensuring traceability in publications and databases. Furthermore, Gene records link seamlessly to dbSNP for variant analysis, enabling queries on population frequencies and phenotypic associations without leaving the Entrez environment. The dbSNP database itself catalogs single nucleotide polymorphisms (SNPs), insertions, deletions, and other variants, supporting genetic association studies and population genetics. These features, combined with cross-links to PubMed for relevant literature, underscore Entrez's utility in integrating molecular sequence data for comprehensive biological research.Taxonomy and Structural Databases
The Entrez Taxonomy database provides a curated hierarchical classification and nomenclature system for organisms represented in public sequence databases, encompassing over 2.7 million taxonomic nodes as of 2025. This includes detailed lineage information tracing evolutionary relationships from domains to species, facilitating phylogenetic analysis through an interactive taxonomy browser that displays the tree structure and links to related genomic data. The database covers a broad spectrum of life forms, with approximately 595,000 nodes for bacteria, 15,000 for archaea, 1.8 million for eukaryotes (including major subgroups like metazoa, fungi, and viridiplantae), and 273,000 for viruses, enabling researchers to explore organismal diversity in evolutionary and structural biology contexts.[18][19][20] Recent enhancements, including 2024 updates to prokaryotic classifications and integration with metagenomic data, support taxonomic assignment for uncultured microbial communities by incorporating environmental sequencing projects into the hierarchy. These updates align with the International Committee on Taxonomy of Viruses (ICTV) and other standards, improving resolution for viral and bacterial phylogenies. Additionally, the BioProject database within Entrez offers metadata on sequencing initiatives, such as project scope, organism associations, and assembly details, which link directly to taxonomy entries to contextualize large-scale genomic efforts without delving into raw sequence data from sources like GenBank. Following the 2024 merger of Entrez Genome and Assembly resources into NCBI Datasets, taxonomy records now provide streamlined access to genome assemblies, enhancing links between organism classifications and structural assemblies for viral, bacterial, and eukaryotic entries.[21][22][23][24][25] The Entrez Structure database, centered on the Molecular Modeling Database (MMDB), archives three-dimensional molecular structures derived from the Protein Data Bank (PDB), focusing on proteins, nucleic acids, and complexes to support studies in structural biology and evolution. As of March 2025, MMDB contains over 233,000 structure records, each enhanced with annotations like chemical graphs, secondary structure assignments, and cross-references to sequence data for functional inference. These models enable visualization of evolutionary conservation through domain alignments and superposition tools. Integrated with the Cn3D viewer, users can interactively explore 3D structures alongside phylogenetic lineages from Taxonomy, highlighting structural motifs across related organisms without requiring separate software.[26][27][28][29] Other notable databases in the taxonomy and structural categories include ClinVar, which aggregates information about genomic variations and their relationship to human health.[30]Core Features
Search and Query Capabilities
Entrez supports a range of search mechanisms designed to facilitate precise retrieval from its integrated databases. Users can construct queries using Boolean operators such as AND, OR, and NOT, which must be entered in uppercase to ensure proper processing. These operators allow for complex combinations, evaluated from left to right unless parentheses are used to group terms, as in the example "g1p3 AND (response element OR promoter)".[1] Field-specific searches enhance targeting by restricting terms to particular data elements, using square bracket notation like [field]. For instance, in PubMed, [tiab] limits searches to titles and abstracts, while [au] specifies authors and [organism] denotes species. Advanced filters further refine queries, including date ranges (e.g., "2015/3/1:2016/4/30[Publication Date]") and MeSH terms (e.g., "neoplasms[MeSH Terms]"), enabling users to narrow results by publication date, organism, or other indexed attributes.[1] The Global Query feature provides a unified entry point, allowing a single search string to span all Entrez databases simultaneously via the NCBI homepage. This returns ranked results across databases, ordered by relevance scoring based on term frequency and proximity, with options to filter by database type for focused exploration.[1] Search History maintains a record of recent queries for up to eight hours of inactivity, permitting users to revisit, combine, or modify them through the Advanced Search interface. Complementing this, the Clipboard temporarily stores up to 500 search results per database, facilitating temporary holding before further actions. Results from either can be exported via the "Send to" menu in formats such as XML or CSV, depending on the database, for offline analysis or integration with other tools.[1]Linking and Cross-Database Navigation
Entrez employs a sophisticated system of hyperlinks known as links to facilitate navigation between related records within and across its integrated databases, enabling users to discover contextual connections without reformulating searches. These links are categorized into two primary types: hard links and neighbor links. Hard links are direct, predefined connections derived from the inherent data relationships in records, such as a PubMed article linking to the Gene entry it cites or a Protein sequence record connecting to its corresponding three-dimensional structure in the Structure database.[1] Neighbor links, in contrast, are computationally generated associations that identify similarities or co-occurrences, such as linking a nucleotide sequence to its taxonomic lineage in the Taxonomy database or suggesting related articles in PubMed based on shared content.[1] This dual approach allows for both explicit and inferred navigation, enhancing the discovery of biological relationships.[1] A key feature of Entrez's cross-database navigation is the use of neighbor links to generate related searches, which provide suggestions based on patterns like co-citation or sequence similarity. For instance, searching for a specific gene in the Gene database may yield neighbor links to homologous sequences in the Nucleotide or Protein databases, derived from alignment algorithms that detect evolutionary relationships.[31] These suggestions appear as facets or sidebar options in search results, allowing users to pivot seamlessly to pertinent data in other databases, such as from a literature abstract to associated genomic variants in dbSNP.[1] By prioritizing these automated connections, Entrez supports exploratory analysis, where users can trace pathways from molecular data to functional annotations without manual intervention.[1] Entrez's linking system incorporates unique concepts like Related Structures and NCBI Orthologs to represent complex biological networks. Related Structures uses the Vector Alignment Search Tool (VAST) to compute neighbor links between protein structures based on three-dimensional similarity, enabling navigation from one structure record to others with analogous folds or functions, such as linking a query enzyme to evolutionarily conserved homologs.[1] Similarly, NCBI Orthologs aggregates orthologous genes across species through automated detection, providing links from a Gene record to 1:1 orthologs in over 100 species, which aids in comparative genomics.[25] These features rely on underlying indexing that groups records by shared attributes, forming conceptual graphs of relatedness.[31] For efficient large-scale navigation, Entrez supports batch linking via Unique Identifiers (UIDs), allowing programmatic retrieval of connections for multiple records simultaneously through tools like the E-utilities' elink function. This capability is particularly useful for workflows involving high-throughput data, where users can fetch neighbor links for an entire set of PubMed articles to their cited genes or proteins in one operation.[32] Overall, this hyperlink infrastructure transforms static database entries into a dynamic, interconnected knowledge graph, promoting interdisciplinary insights in biomedical research.[1]Access and Usage
Web-Based Interface
The Entrez web-based interface provides an intuitive browser-accessible entry point for users to search and retrieve data from NCBI's interconnected databases, centered around a prominent search bar located at the top of the NCBI homepage. This search bar allows users to enter queries using natural language terms, phrases, Boolean operators (such as AND, OR, and NOT), wildcards, and field-specific restrictions, with a pull-down menu for selecting from over 30 supported databases. Below the search bar, options for advanced search link to a dedicated builder tool that enables constructing complex queries via indexed fields and maintains a search history for iterative refinement. The overall layout emphasizes simplicity and accessibility, including skip-to-content links and access keys for keyboard navigation, ensuring compliance with web standards for users with disabilities.[1] Upon submitting a search, results appear in a paginated summary view, displaying 20 records per page by default, with adjustable settings via a "Display Options" menu to show 10, 50, 100, or 200 items. The left-hand sidebar features facets for filtering results by attributes like publication date, organism, or availability of full text, allowing users to narrow large result sets efficiently. Individual records can be expanded to full views tailored to the database, revealing detailed metadata, abstracts, or sequences, while a "Send To" dropdown facilitates exporting selections to formats such as CSV, XML, or direct integration with tools like citation managers. Pagination controls at the bottom of result pages enable navigation through thousands of hits, supporting workflows from broad discovery to targeted retrieval.[1] To aid users, the interface integrates comprehensive help resources, including inline tooltips, a searchable help manual with tutorials on query syntax and navigation, and guided examples for common tasks. Integration with NCBI Accounts via My NCBI allows registered users to save searches, set up email alerts for new results, and store collections of records for later access, addressing limitations in anonymous sessions by persisting preferences across devices. The interface incorporates a responsive, mobile-first design that adapts to various screen sizes, enhancing usability on tablets and smartphones without requiring separate apps. As of 2025, these features reflect ongoing refinements to streamline the user experience, with no major redesign implemented.[1][33]Programmatic and API Access
Entrez offers programmatic access primarily through the Entrez Programming Utilities (E-utilities), a suite of eight server-side programs that provide a stable interface for querying and retrieving data from its interconnected databases.[34] These utilities enable developers to perform operations such as searching, fetching records, summarizing data, and linking across databases, supporting output formats including XML and, for select utilities, JSON.[3] Key examples include ESearch, which retrieves unique identifiers (UIDs) matching a query term, and EFetch, which downloads full records based on those UIDs.[35] To prevent server overload, NCBI imposes rate limits on E-utilities requests: three per second without an API key and ten per second with a registered key obtained via an NCBI account.[36] Developers must adhere to these guidelines, which also recommend batching large jobs by using the WebEnv parameter and History server to store intermediate search results as temporary sessions, allowing subsequent utilities to reference and process them efficiently without repeated full queries.[3] For instance, a workflow might involve EPost to upload a large list of UIDs into a history session, followed by EFetch in batches to retrieve records while respecting limits. Several programming libraries and tools simplify integration with E-utilities. In Python, Biopython's Bio.Entrez module wraps the utilities, offering functions likeesearch() and efetch() that handle URL construction, XML parsing, and automatic rate limiting.[37] For R users, the rentrez package provides similar functionality, including entrez_search() for querying and entrez_fetch() for data retrieval, with built-in support for API keys and JSON output.[38] On Unix-like systems, Entrez Direct (EDirect) enables command-line scripting through executables like esearch and efetch, facilitating pipeline automation and integration with tools such as awk or sed for data processing; EDirect was updated to version 24.2 on June 20, 2025, with refactored archive paths.[32][39]
While the web-based interface serves manual exploration, E-utilities and associated libraries are designed for scripted, high-volume access in research workflows, ensuring compliance with NCBI's policies on data usage and attribution. The E-utilities documentation was last updated on March 25, 2025.[3][34]