Hubbry Logo
search
logo

European Bioinformatics Institute

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff.[4]

Key Information

Further, the EMBL-EBI hosts training programs that teach scientists the fundamentals of the work with biological data and promote the plethora of bioinformatic tools available for their research, both EMBL-EBI-based and not so.

Bioinformatic services

[edit]

One of the roles of the EMBL-EBI is to index and maintain biological data in a set of databases, including Ensembl (housing whole genome sequence data), UniProt (protein sequence and annotation database) and Protein Data Bank (protein and nucleic acid tertiary structure database). A variety of online services and tools is provided, such as Basic Local Alignment Search Tool (BLAST) or Clustal Omega sequence alignment tool, enabling further data analysis.

BLAST

[edit]

BLAST[5] is an algorithm for comparing biomacromolecule primary structure, most often nucleotide sequence of DNA/RN, and amino acid sequence of proteins, stored in the bioinformatic databases, with the query sequence. The algorithm uses scoring of the available sequences against the query by a scoring matrix such as BLOSUM 62. The highest scoring sequences represent the closest relatives of the query, in terms of functional and evolutionary similarity.[6]

The database search by BLAST requires input data to be in a correct format (e.g. FASTA, GenBank, PIR or EMBL format). Users may also designate the specific databases to be searched, select scoring matrices to be used and other parameters prior to the tool run. The best hits in the BLAST results are ordered according to their calculated E-value (the probability of the presence of a similarly or higher-scoring hit in the database by chance).[7]

Clustal Omega

[edit]

Clustal Omega[8] is a multiple sequence alignment (MSA) tool that enables to find an optimal alignment of at least three and maximum of 4000 input DNA and protein sequences.[9] Clustal Omega algorithm employs two profile Hidden Markov models (HMMs) to derive the final alignment of the sequences. The output of the Clustal Omega may be visualized in a guide tree (the phylogenetic relationship of the best-pairing sequences) or ordered by the mutual sequence similarity between the queries.[10] The main advantage of Clustal Omega over other MSA tools (Muscle, ProbCons) is its efficiency, while maintaining a significant accuracy of the results.

Ensembl

[edit]

Based at the EMBL-EBI, the Ensembl[11] is a database organized around genomic data, maintained by the Ensembl Project. Tasked with the continuous annotation of the genomes of model organisms, Ensembl provides researchers a comprehensive resource of relevant biological information about each specific genome. The annotation of the stored reference genomes is automatic and sequence-based. Ensembl encompasses a publicly available genome database which can be accessed via a web browser. The stored data can be interacted with using a graphical UI, which supports the display of data in multiple resolution levels from karyotype, through individual genes, to nucleotide sequence.[12]

Originally centered on vertebrate animals as its main field of interest, since 2009 Ensembl provides annotated data regarding the genomes of plants, fungi, invertebrates, bacteria and other species, in the sister project Ensembl Genomes. As of 2020, the various Ensembl project databases together house over 50,000 reference genomes.[13]

PDB

[edit]

Protein Data Bank (PDB)[14] is a database of three dimensional structures of biological macromolecules, such as proteins and nucleic acids. The data are typically obtained by X-ray crystallography or nuclear magnetic resonance spectroscopy (NMR spectroscopy), and submitted manually by structural biologists worldwide through PDB member organizations – PDBe, RCSB, PDBj and BMRB. The database can be accessed through the webpages of its members, including PDBe (housed at the EMBL-EBI). As a member of the Worldwide Protein Data Bank (wwPDB) consortium, PDBe aids in the joint mission of archiving and maintenance of macromolecular structure data.[15]

UniProt

[edit]

UniProt is an online repository of protein sequence and annotation data, distributed in UniProt Knowledgebase (UniProt KB), UniProt Reference Clusters (UniRef) and UniProt Archive (UniParc) databases. Originally conceived as the individual ventures of EMBL-EBI, Swiss Institute of Bioinformatics (SIB) (together maintaining Swiss-Prot and TrEMBL) and Protein Information Resource (PIR) (housing Protein Sequence Database), the increase in the global protein data generation led to their collaboration in the creation of UniProt in 2002.[16]

The protein entries stored in UniProt are cataloged by a unique UniProt identifier. The annotation data collected for the each entry are organized in logical sections (e.g. protein function, structure, expression, sequence or relevant publications), allowing a coordinated overview about the protein of interest. Links to external databases and original sources of data are also provided. In addition to standard search by the protein name/identifier, UniProt webpage houses tools for BLAST searching, sequence alignment or searching for proteins containing specific peptides.[17]

AlphaFold DB

[edit]

The AlphaFold Protein Structure Database (AlphaFold DB) is a collaborative project with Google DeepMind to make predicted protein structures from the AlphaFold AI system freely available to the scientific community.[18] The first release of the database was in 2021; as of 2024, AlphaFold DB provides access to over 214 million protein structures.[19]

Other bioinformatics organisations

[edit]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental outstation of the European Molecular Biology Laboratory (EMBL), a pan-European research organization founded in 1974 with 29 member states, dedicated to advancing molecular biology through open data infrastructure and computational tools.[1] Established in 1992 by the EMBL Council and located on the Wellcome Genome Campus in Hinxton, United Kingdom, including the new Thornton Building opened in 2024,[2] EMBL-EBI functions as the world's leading provider of public biomolecular data, enabling global life sciences research by curating, disseminating, and analyzing vast datasets from genomics, proteomics, and structural biology.[3][4] EMBL-EBI's core mission is to promote scientific discovery by maintaining freely accessible databases and developing bioinformatics resources that support researchers in addressing challenges from basic biology to clinical applications, such as drug discovery, vaccine development, and genomic medicine.[5] Key services include the European Nucleotide Archive (ENA) for raw sequencing data, the European Genome-phenome Archive (EGA) for controlled-access genomic datasets, UniProt for protein sequence and function information, Ensembl for genome annotation, and the Protein Data Bank in Europe (PDBe) for biomolecular structures, among others like ChEMBL for chemical biology and the BioImage Archive for microscopy data.[4] EMBL-EBI's resources are widely used in immunoinformatics and support research into peptide- and protein-based vaccines.[6] These resources handled over 15 petabytes of data deposits in 2023, serving 4.8 million unique users worldwide in 2023 and facilitating collaborations such as the AlphaFold Protein Structure Database, which contributed to the 2024 Nobel Prize in Chemistry for protein structure prediction.[4][5] With over 850 staff members from more than 70 nationalities,[7] EMBL-EBI conducts basic research in computational biology, offers extensive training programs—including 21 in-person courses and 37 webinars, with online training resources reaching over 600,000 unique users in 2023—and partners with industry and initiatives like ELIXIR to ensure long-term data preservation and interoperability across Europe.[5] Funded by EMBL member states, the European Commission, Wellcome, UK Research and Innovation, and the US National Institutes of Health, the institute emphasizes open science principles, making its tools, software, and datasets available without restrictions to foster innovation in areas like biodiversity, pathogens, and rare diseases.[4][5]

History

Establishment

The European Bioinformatics Institute (EMBL-EBI) was established on 1 September 1994 in Hinxton, Cambridgeshire, United Kingdom, as the third outstation of the European Molecular Biology Laboratory (EMBL).[8][9][10] This founding followed a decision by the EMBL Council in 1992 to create a dedicated European center for bioinformatics services, marking the transition of core operations from EMBL's headquarters in Heidelberg, Germany.[11] The relocation began that year and involved transferring the EMBL Data Library—Europe's primary repository for nucleotide sequences—and the collaborative SWISS-PROT protein sequence database, which had been maintained jointly with the University of Geneva.[12][13] EMBL-EBI's initial objectives centered on centralizing bioinformatics infrastructure across Europe to facilitate the collection, annotation, analysis, and public dissemination of molecular biology data, thereby enabling collaborative research in the life sciences.[14][8] The institute operated from the outset under the founding leadership of Graham Cameron, who served as its first director from 1994 to 2003 and had previously headed EMBL's Data Library in Heidelberg.[15] As an integral part of EMBL—an intergovernmental organization—EMBL-EBI received funding through contributions from EMBL's member states, which numbered 15 by the mid-1990s.[16][17]

Key Milestones

In the late 1990s, EMBL-EBI expanded its core data resources, including the launch of the EMBL Nucleotide Data Bank in 1994 as a predecessor to the modern European Nucleotide Archive (ENA), which served as a central repository for nucleotide sequences.[8] Simultaneously, the integration of the SWISS-PROT protein knowledgebase—initially hosted with around 38,000 entries—laid the foundation for what would evolve into UniProt through subsequent mergers and updates.[8][18] The 2000s marked substantial growth for EMBL-EBI, beginning with the initiation of the Ensembl genome annotation project in 1999 as a joint effort with the Wellcome Sanger Institute to provide automated annotations for the human genome and other species ahead of the Human Genome Project's completion.[19] The institute's facilities were consolidated at the Wellcome Genome Campus in Hinxton, with expansions enhancing computational infrastructure by the early 2000s.[20] Staff numbers grew rapidly, surpassing 200 by 2005 to support the increasing demands of genomic data management.[21] During the 2010s, EMBL-EBI advanced its technological capabilities, notably adopting cloud computing solutions for data processing and storage, such as the R Cloud service launched in 2010 to enable remote access to large datasets.[22] In 2014, the institute celebrated its 20th anniversary with events emphasizing the role of big data in biological research, highlighting the exponential growth in sequence and structural information.[15] The 2020s brought transformative highlights, including EMBL-EBI's rapid response to the COVID-19 pandemic through the launch of the COVID-19 Data Portal in April 2020, which aggregated SARS-CoV-2 genomic, proteomic, and clinical data to accelerate global research efforts.[23] In July 2021, in collaboration with DeepMind, EMBL-EBI released the AlphaFold Protein Structure Database, providing predicted 3D structures for over 350,000 proteins from key organisms and enabling breakthroughs in structural biology.[24] By 2024, the database expanded to cover over 214 million protein structures, vastly increasing accessible structural predictions. In October 2025, EMBL-EBI and Google DeepMind renewed their partnership, releasing a major update to the AlphaFold Database to better align with UniProt.[25] These developments aligned with EMBL's 2022–2026 "Molecules to Ecosystems" programme, which integrates molecular data with environmental and ecosystem-level analyses to address complex biological challenges.[26] By 2024, EMBL-EBI's resources handled an average of 123 million daily web requests from 42 million unique IP addresses annually, underscoring their global scale.[17] These services have enabled over 100,000 scientific publications each year, supporting advancements across biomedicine, agriculture, and environmental science.[17]

Organization and Governance

Structure within EMBL

The European Bioinformatics Institute (EMBL-EBI) operates as one of six outstations of the European Molecular Biology Laboratory (EMBL), an intergovernmental organization established in 1974 and headquartered in Heidelberg, Germany.[27] The other EMBL sites are located in Barcelona (Spain), Grenoble (France), Hamburg (Germany), Hinxton (United Kingdom, hosting EMBL-EBI), and Rome (Italy), each focusing on distinct aspects of molecular biology research, services, and training.[28] As of 2025, EMBL comprises 29 full member states, including founding members such as Austria, Denmark, France, Germany, Israel, Italy, the Netherlands, Sweden, Switzerland, and the United Kingdom, along with later accessions like Latvia in 2024.[29][16] EMBL-EBI reports directly to EMBL's Director General, who is appointed by and accountable to the EMBL Council, the organization's primary governing body composed of representatives from all member states.[30] The EMBL Council meets biannually to oversee strategic direction, financial compliance, and programme approval, ensuring alignment across all sites.[31] Additionally, an internal Scientific Advisory Committee, consisting of independent experts, provides strategic advice to the Council on scientific programmes and priorities, including those relevant to bioinformatics infrastructure at EMBL-EBI.[32] Funding for EMBL-EBI is derived from multiple sources, with EMBL member state contributions accounting for 42% of its operating expenditure in 2024, supplemented by external grants (29%), capital awards (23%), and commercial collaborations (6%).[33] Key external funders include the European Commission through Horizon Europe, the Wellcome Trust, UK Research and Innovation (UKRI), and the US National Institutes of Health, supporting specific projects and infrastructure.[33][4] The total operating expenditure for EMBL-EBI in 2024 was approximately €105.5 million, reflecting its scale in managing global bioinformatics resources.[33] As a non-profit entity within the EMBL framework, EMBL-EBI adheres to an open-access policy, making its data resources freely available under permissive licenses such as CC0 where applicable, to promote unrestricted reuse by the global research community.[34] This model emphasizes data resilience through robust archiving and adherence to FAIR principles—ensuring data are Findable, Accessible, Interoperable, and Reusable—to facilitate discovery and collaboration in life sciences.[34][4] EMBL-EBI serves as the hub for ELIXIR, Europe's distributed bioinformatics infrastructure, coordinating activities across 21 member countries and their national nodes to build sustainable data management capabilities.[35] Through this role, it fosters collaboration between national organizations, such as the Dutch Techcentre for Life Sciences in the Netherlands, to integrate and standardize bioinformatics tools and datasets continent-wide.[35]

Leadership and Staff

The European Bioinformatics Institute (EMBL-EBI) is led by Johanna (Jo) McEntyre as Interim Director since March 2025, following Ewan Birney's transition to the role of Executive Director for the broader European Molecular Biology Laboratory (EMBL).[36][37] Birney, who served as EMBL-EBI Director from 2015 to 2025, has overseen strategic advancements in computational biology, including the integration of large-scale genomic data resources and open science initiatives.[38][39] Associate Directors support specialized functions, such as services, research, and operations. For instance, Cath Brooksbank serves as Head of Training, leading programs that reach tens of thousands of researchers annually through workshops and online resources focused on bioinformatics skills.[40] Historically, Rolf Apweiler held roles as Joint Director from 2015 to 2024 and Associate Director until his retirement in October 2025, contributing to the development of protein databases like UniProt.[41][42] As of 2025, EMBL-EBI employs approximately 700 full-time equivalents, comprising bioinformaticians, data curators, software engineers, and early-career trainees such as PhD students and postdocs.[43] The workforce is highly diverse, drawing from over 60 countries, which fosters interdisciplinary collaboration in molecular biology data management.[43] Recruitment emphasizes interdisciplinary expertise in biology, computer science, and data science, with dedicated programs for early-career researchers including fellowships and training pathways that prioritize inclusivity and no restrictions on nationality, gender, or age.[44][45] Under current leadership, EMBL-EBI has advanced AI integration for data curation, using large language models to streamline annotation processes while ensuring accuracy, as detailed in 2024 institutional reports and ongoing pilots.[33][46]

Facilities and Location

Hinxton Campus

The European Bioinformatics Institute (EMBL-EBI) is situated on the Wellcome Genome Campus in the village of Hinxton, Cambridgeshire, United Kingdom, approximately 10 miles (16 km) south of Cambridge. This 125-acre site, shared with the Wellcome Sanger Institute and other genomics organizations, provides a dedicated hub for bioinformatics research and collaboration.[17][47] EMBL-EBI was established on the Hinxton campus in September 1994, following the relocation of key bioinformatics services from temporary facilities in Heidelberg, Germany, a process that began in 1992. Designed to capitalize on the growing field of genomics, the site transitioned to permanent buildings around 2000, enabling expanded operations and interdisciplinary partnerships. By 2005, further expansions addressed space needs for a growing staff, moving beyond initial temporary accommodations to support over 400 researchers.[12][48] The campus amenities include state-of-the-art laboratories, secure data centers for managing vast biological datasets, and expansive green spaces that promote well-being and informal collaboration. Its location near Cambridge facilitates strong ties with academic institutions, enhancing knowledge exchange in life sciences. Sustainability efforts are prominent, with buildings like the Thornton Building achieving a BREEAM Excellent certification (76% score as of 2024) through energy-efficient designs, including optimized systems for petabyte-scale data storage that minimize environmental impact. In March 2025, the Thornton Building opened as EMBL-EBI's third permanent facility, providing space for collaborative research on topics such as infectious diseases and biodiversity.[49][50] The Hinxton site fosters a vibrant community by hosting public events, guided tours, and educational programs, while integrating with the local ecosystem through initiatives like the Wetlands Nature Reserve for biodiversity monitoring. These activities align with EMBL's broader programmes in environmental research and public engagement.[51][52][53]

Infrastructure

The European Bioinformatics Institute (EMBL-EBI) relies on a robust technological infrastructure to manage and disseminate vast biological datasets. Its data storage systems encompass over 300 petabytes of raw capacity, comprising flash, SSD, disk, and tape storage types that house approximately 25 billion files and objects.[54] To ensure resilience and scalability, EMBL-EBI employs a hybrid cloud model integrating public providers such as AWS and Google Cloud alongside private cloud platforms, with replicated storage distributed across three geographically separate locations for high availability and automatic failover.[54][55] This setup supports intense usage, processing 3.5 billion web requests per month from an average of 5.6 million unique IP addresses in 2024.[56] Computing resources at EMBL-EBI include high-performance computing (HPC) clusters optimized for handling large-scale data analysis, particularly in artificial intelligence (AI) and machine learning (ML) applications.[57] These clusters facilitate advanced tasks such as protein structure prediction and genomic sequencing, with recent integrations of large language models (LLMs) enhancing text mining and curation workflows—for instance, LLMs automate annotation of scientific literature to accelerate database updates.[58][59] The open-source BioChatter framework exemplifies this, enabling customizable LLM applications for biomedical research.[60] The software ecosystem supports seamless data interactions through open-source tools for submission and retrieval, including the Job Dispatcher for sequence analysis and DBfetch for efficient data access.[61][62] RESTful APIs provide programmatic interfaces, allowing developers to integrate EMBL-EBI resources like Ensembl and UniProt into external pipelines without restrictions beyond data owner policies.[63] Security measures align with the EU General Data Protection Regulation (GDPR) and FAIR data principles, while open science policies promote unrestricted access under machine-readable licenses.[64][65] Long-term preservation is achieved via geo-dispersed backups in public clouds and collaborations with international consortia such as the International Nucleotide Sequence Database Collaboration (INSDC).[65] Recent infrastructure upgrades in 2024 and 2025 have emphasized AI-driven capabilities, including expanded cloud integrations to manage surging demand from resources like the AlphaFold Database, which now hosts over 200 million protein structure predictions and serves millions of users globally.[25] These enhancements, part of a renewed partnership with Google DeepMind, incorporate multiple sequence alignments and isoform support to bolster predictive accuracy and usability amid post-AlphaFold traffic growth.[25]

Bioinformatics Databases

Ensembl

Ensembl is a flagship bioinformatics resource developed by the European Bioinformatics Institute (EMBL-EBI) in collaboration with the Wellcome Sanger Institute, launched in 1999 to provide automated annotation and analysis of large-scale genomic data during the Human Genome Project era.[66] The project initially focused on vertebrate genomes, with its public website debuting in July 2000 to disseminate draft human genome annotations ahead of formal publication.[19] Over the years, Ensembl has evolved into a comprehensive platform integrating sequence data, gene models, and comparative analyses, supporting research across evolutionary biology and medicine. The core functions of Ensembl encompass automated gene annotation, identification of regulatory features such as promoters and enhancers, and variant effect prediction to assess the impact of genetic variations on genomic elements.[67] Central to these capabilities is the Ensembl Variant Effect Predictor (VEP), a tool that annotates variants—including single nucleotide polymorphisms, insertions, deletions, and structural variants—by predicting their consequences on transcripts, proteins, and regulatory regions, incorporating scores like SIFT and PolyPhen for functional impact.[68] These features enable researchers to explore genome architecture and functional elements without relying on manual curation. Ensembl integrates diverse data sources to facilitate holistic genomic inquiries, linking genomic sequences from the European Nucleotide Archive (ENA) and protein annotations from UniProt to provide context for gene functions and evolutionary relationships.[69] It supports comparative genomics by aligning sequences across species, highlighting conserved regions and orthologs, particularly among vertebrates but extending to broader eukaryotic and prokaryotic datasets through Ensembl Genomes.[70] This interconnected framework allows users to trace evolutionary changes and identify disease-associated variants in a multi-species context. Ensembl powers genomic research in areas such as human health, disease susceptibility, and evolutionary biology, serving as a foundational resource for projects analyzing genetic diversity and regulatory mechanisms.[67] It receives millions of daily web requests, reflecting its widespread adoption by the global research community.[71] Complementary tools like BLAST enable sequence similarity searches that can be combined with Ensembl's annotation pipelines for deeper analysis. Updates to Ensembl occur through regular releases, approximately every three months, incorporating new genome assemblies, refined annotations, and expanded datasets.[72] In 2024, enhancements included support for long-read sequencing data via the GENCODE Comprehensive Long-read Sequencing project, improving transcript isoform resolution and annotation accuracy for complex genomes.[73] The current release, Ensembl 115 from September 2025, covers 314 vertebrate species in its core database, with Ensembl Genomes extending to over 4,800 eukaryotic and 31,300 prokaryotic genomes for broader comparative studies.[74][75]

UniProt

UniProt, developed and maintained by the European Bioinformatics Institute (EMBL-EBI) in collaboration with the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR), emerged in 2002 as a unified resource through the merger of the manually curated Swiss-Prot database, the automatically annotated TrEMBL database, and the PIR Protein Sequence Database (PIR-PSD).[76][77][78] This consolidation addressed the growing volume of protein data from genomic projects, creating a centralized hub for high-quality protein information. As of the 2025_04 release on October 15, 2025, UniProt encompasses approximately 199 million protein sequences, following a reduction to focus on high-quality reference proteomes. This release implemented a major update by restricting UniProtKB/TrEMBL to sequences from reference proteomes, removing approximately 54 million redundant entries to enhance data quality and focus on representative sequences.[79][80][81] The core of UniProt is the UniProt Knowledgebase (UniProtKB), divided into two sections: UniProtKB/Swiss-Prot, which provides expertly curated entries with detailed functional annotations for a subset of sequences, and UniProtKB/TrEMBL, which includes computationally predicted annotations for the vast majority of sequences to ensure comprehensive coverage.[82][83] Complementing UniProtKB are UniRef clusters, which reduce redundancy by grouping similar sequences at 100%, 90%, or 50% identity thresholds to facilitate comparative analyses, and UniParc, a non-redundant archive that preserves all protein sequences from public databases without annotations to track historical versions and isoforms.[84][85][86] UniProt annotations focus on protein function and structure, including functional domains, post-translational modifications (PTMs) such as phosphorylation and glycosylation, and molecular interactions like protein-protein binding sites.[87][88] These are standardized using controlled vocabularies, notably the Gene Ontology (GO) for molecular function, biological process, and cellular component terms, enabling consistent cross-species comparisons.[89] Curation in UniProt combines manual expert annotation with automated methods, including rule-based systems and artificial intelligence (AI) for propagating information from curated templates to uncharacterized proteins.[90][91][92] The manual process involves literature review from over 500,000 publications, sequence analysis, and family-based curation to ensure accuracy and evidence-based claims, with automatic annotation handling the scale of incoming data.[93][94] UniProt supports critical applications in proteomics by providing reference sequences and functional data for mass spectrometry identification, in drug discovery through insights into protein targets, variants, and interaction networks, and in vaccine research by providing comprehensive protein sequence and functional annotations used to identify vaccine antigens and support immunoinformatics studies for peptide/protein-based vaccines.[95][96] Access is facilitated via a RESTful API for programmatic queries and bulk downloads of datasets in formats like FASTA and XML, allowing integration into workflows and large-scale analyses.[97][98] It also links to genomic resources like Ensembl for contextualizing proteins within whole-genome annotations.

PDBe

The Protein Data Bank in Europe (PDBe), hosted by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), serves as the European portal for the Worldwide Protein Data Bank (wwPDB). As a founding member of the wwPDB established in 2003, PDBe plays a key role in collecting, processing, archiving, and disseminating experimentally determined three-dimensional (3D) structures of biological macromolecules.[99][100] This includes managing depositions through the unified OneDep system and ensuring data quality via rigorous annotation and validation protocols. PDBe specifically handles curation responsibilities for submissions from European and African institutions, processing thousands of entries annually to maintain the integrity of the global archive.[101] The PDBe archive encompasses a diverse array of structural data, primarily derived from X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM). These structures cover proteins, nucleic acids, macromolecular complexes, and associated small molecules such as ligands and cofactors. Each entry is accompanied by comprehensive validation reports that assess geometric quality, model-to-data fit, and biological relevance, aiding researchers in interpreting and utilizing the data effectively. As of November 2025, the wwPDB archive, synchronized across PDBe and its partner sites, holds nearly a quarter million such experimentally determined structures, reflecting the rapid growth in structural biology.[102][103] To enhance accessibility and utility, PDBe provides integrated tools for data exploration and analysis. The PDBe Knowledge Base (PDBe-KB) aggregates functional and biophysical annotations from multiple specialist resources, offering insights into protein function, evolutionary relationships, ligand interactions, and disease associations for PDB entries.[104] Visualization is supported through the open-source Mol* viewer, which enables interactive 3D rendering, superposition, and analysis of structures directly in web browsers. In 2024, PDBe-KB introduced enhancements for seamless integration of predicted structures from the AlphaFold Database, allowing users to compare experimental and computational models in a unified interface.[105][106] PDBe's contributions significantly impact structural biology and related fields, facilitating drug discovery, protein engineering, rational vaccine design, and fundamental research by providing high-quality, standardized data to a global community. PDBe offers biomolecular structure data essential for rational vaccine design, epitope mapping, and understanding protein structures relevant to peptide/protein vaccines.[107] The wwPDB resources, including PDBe, serve millions of unique users annually, with billions of data downloads underscoring their essential role in advancing biomedical science.[108][109]

AlphaFold Database

The AlphaFold Protein Structure Database is a comprehensive open-access resource developed by the European Bioinformatics Institute (EMBL-EBI) in collaboration with Google DeepMind, providing AI-generated three-dimensional models of proteins to support biological research. Launched in July 2021, the database initially offered predictions for over 365,000 structures across 21 model organism proteomes, marking a significant advancement in making high-accuracy protein structure predictions widely available to the scientific community.[24][110] The database's predictions are generated using the AlphaFold 2 deep learning system, which employs a neural network trained on known protein structures to infer 3D conformations from amino acid sequences. Each model includes a per-residue confidence score known as predicted Local Distance Difference Test (pLDDT), ranging from 0 to 100, where scores above 90 indicate very high reliability and correspond to accurate backbone geometry comparable to experimental methods.[111][112] This scoring system enables users to assess model quality without additional validation, prioritizing regions with high confidence for downstream applications. Coverage encompasses the complete proteomes of humans and 47 key model organisms relevant to research and global health, such as Escherichia coli, Saccharomyces cerevisiae, and Drosophila melanogaster, alongside predictions for nearly all entries in the UniProt database, the central repository for protein sequences and annotations.[113][114] Integration with UniProt allows seamless linking via unique identifiers, facilitating cross-referencing of sequence data with structural models for enhanced functional annotation.[115] Access to the database is free and unrestricted under a Creative Commons BY 4.0 license, with bulk downloads available for entire proteomes or subsets like the reviewed Swiss-Prot section of UniProt, and an API enabling programmatic retrieval of metadata and structures.[116][105] These models have been instrumental in accelerating drug design, such as identifying binding sites for novel therapeutics, and in studying disease mechanisms by revealing previously unknown protein folds.[117][118] Furthermore, the predictions have supported vaccine design by providing accurate models for antigens lacking experimental structures, aiding immunogen development and immunoinformatics.[119] In 2024, the database was updated to include over 214 million predictions, expanding coverage to align more closely with the full UniProt knowledgebase and incorporating additional data on proteins of global health importance.[105] By late 2025, following a renewed partnership between EMBL-EBI and Google DeepMind, it had grown to encompass approximately 250 million structures, reflecting ongoing synchronization with UniProt's evolving sequence data.[25] Further developments in 2024 integrated support for multimeric complexes and ligand interactions, derived from advancements in the underlying AlphaFold models, enhancing utility for studying protein-protein and protein-small molecule interfaces.[120] Ethical considerations surrounding the database emphasize responsible use to mitigate potential misuse, such as in designing harmful biomolecules, though analyses indicate that AlphaFold's predictions do not substantially lower barriers to such activities compared to existing experimental techniques.[121] Developers have implemented attribution requirements and licensing terms to promote equitable access while discouraging applications that could pose biosecurity risks.[122]

Bioinformatics Tools

BLAST

The European Bioinformatics Institute (EMBL-EBI) provides a web-based implementation of the Basic Local Alignment Search Tool (BLAST), an algorithm originally developed at the National Center for Biotechnology Information (NCBI) for identifying regions of local similarity between biological sequences, which can reveal functional, structural, or evolutionary relationships. Since the 1990s, EMBL-EBI has hosted this service through its Job Dispatcher framework, offering free, user-friendly access to NCBI BLAST+ software for researchers worldwide, enabling rapid sequence comparisons without local installation.[123] The service supports nucleotide and protein queries against comprehensive databases, including UniProt for proteins and Ensembl for genomic sequences, allowing users to submit sequences in FASTA format or by identifier for similarity searches.[62] Key variants include BLASTN for comparing nucleotide sequences to nucleotide databases, BLASTP for protein-to-protein alignments, and TBLASTN for querying proteins against translated nucleotide databases, with additional options like BLASTX for translated nucleotide queries.[124] These variants facilitate diverse applications, such as identifying homologous genes or predicting protein functions based on sequence conservation.[62] Statistical significance of alignments is evaluated using the expect value (E-value), which estimates the number of hits of similar quality expected by chance in a database of the given size; lower E-values indicate more reliable matches. The E-value is computed as
E=KmneλS E = Kmn e^{-\lambda S}
where $ m $ and $ n $ are the lengths of the query sequence and effective database size, respectively, $ S $ is the raw alignment score, and $ K $ and $ \lambda $ are constants derived empirically for the scoring matrix and gap penalties used. Database-specific parameters ensure accurate interpretation across different search contexts, such as varying sequence lengths or composition biases. BLAST at EMBL-EBI enables high-throughput detection of sequence similarities, supporting workflows in genomics, proteomics, and evolutionary biology by quickly scanning large datasets for potential homologs.[62] It integrates seamlessly with other EMBL-EBI resources, such as UniProt for functional annotation of hits and Ensembl for genomic context, allowing users to chain analyses like retrieving aligned sequences for further visualization or modeling.[62] Recent enhancements to the underlying Job Dispatcher in 2024 include a redesigned website with interactive result visualizations, streamlined job submission and monitoring, and updated documentation to improve accessibility and performance for large-scale queries.[123] The service processes a substantial volume of searches annually, contributing to the over 100 million jobs handled across EMBL-EBI's sequence analysis tools each year as of 2023.[125] As a heuristic algorithm, BLAST approximates optimal local alignments to achieve computational efficiency, trading completeness for speed and thus potentially overlooking faint similarities that global alignment methods might detect, though it remains highly effective for initial screening in most bioinformatics pipelines.

Clustal Omega

Clustal Omega is a multiple sequence alignment (MSA) program developed at the European Bioinformatics Institute (EMBL-EBI), designed for aligning large sets of protein or nucleotide sequences with high accuracy and efficiency. Released in 2011 as a successor to the earlier ClustalW and ClustalX programs, it was created to address the limitations of previous versions in handling very large datasets, enabling alignments of tens of thousands of sequences on standard computing hardware. The tool incorporates advanced techniques such as seeded guide trees and hidden Markov model (HMM) profile-profile alignments to improve both speed and quality, making it a cornerstone of bioinformatics workflows at EMBL-EBI.[126] The core algorithm of Clustal Omega employs a progressive alignment strategy, beginning with the construction of a guide tree using the mBed method, which embeds sequences into a low-dimensional space for rapid hierarchical clustering. This approach achieves a time complexity of O(N log N) for N sequences during guide tree building, significantly outperforming the O(N²) complexity of traditional neighbor-joining methods in ClustalW. Sequences are then progressively aligned based on this tree, with optional iterations using HHalign for refining HMM profiles, enhancing accuracy for divergent sequences. The program supports external profile alignments, allowing users to incorporate pre-built HMMs from databases like Pfam to align query sequences against known families.[126][127] Key features include scalability to thousands or even hundreds of thousands of sequences, with built-in multi-threading for parallel computation of distance matrices and partial parallelization in the progressive alignment phase, enabling efficient processing on multi-core systems. Input sequences can be provided in formats such as FASTA, GenBank, or EMBL, while outputs are available in multiple formats including FASTA, PHYLIP, and Clustal for compatibility with downstream tools. Since its initial release, updates have added support for DNA and RNA alignments, zipped input files, and customizable clustering parameters to optimize performance for very large datasets.[128][127] Clustal Omega finds wide application in phylogenetics, where its guide trees facilitate evolutionary analysis, and in motif discovery by revealing conserved regions across sequence families. It is integrated into EMBL-EBI's Ensembl platform for genome annotations and comparative genomics, and can process outputs from pairwise similarity searches to build comprehensive alignments. Performance benchmarks demonstrate its superiority in accuracy and speed over alternatives like MAFFT and MUSCLE for datasets up to 50,000 sequences, with sum-of-pairs scores around 0.708 on reference alignments. The original describing paper has garnered over 12,000 citations, underscoring its impact in bioinformatics research.[126][127]

Other Analysis Tools

In addition to sequence alignment tools, EMBL-EBI hosts a suite of specialized analysis tools that support diverse aspects of bioinformatics, including protein function prediction, chemical entity annotation, gene expression profiling, pathway mapping, and variant interpretation. These resources enable researchers to perform integrative analyses by querying underlying databases such as UniProt for protein sequences.[129] InterPro is a key tool for protein domain prediction and functional classification, integrating signatures from multiple databases including Pfam and SMART to identify protein families, domains, and sites. It processes protein sequences to annotate functional elements, aiding in the understanding of protein evolution and interactions. Developed collaboratively and maintained by EMBL-EBI, InterPro is open-source and community-driven, with its 2025 release (version 105.0) incorporating AI-driven improvements for enhanced classification accuracy.[130][131][132] ChEBI serves as a dictionary for chemical entities of biological interest, providing structured data on small molecules, their roles, and ontologies for cheminformatics applications. Users can search, visualize, and download chemical structures and annotations, facilitating integration with metabolic and drug discovery workflows. As an open-source resource curated by EMBL-EBI, ChEBI's 2025 update (version 2.0) introduced new APIs and data products to improve accessibility and interoperability.[133][134] The Expression Atlas tool analyzes and visualizes gene and protein expression patterns across species, tissues, and conditions, drawing from RNA-seq, microarray, and proteomics datasets. It supports differential expression analysis and baseline expression queries, helping researchers explore regulatory mechanisms. Maintained as an open, community-contributed resource by EMBL-EBI, it receives regular updates with new datasets and features for single-cell analysis.[135][136] For pathway analysis, Reactome provides an open-source database and toolset for visualizing, interpreting, and analyzing biological pathways, including signal transduction and metabolic processes. Its enrichment analysis functionality identifies overrepresented pathways in gene lists from high-throughput experiments. Developed through international collaboration and hosted by EMBL-EBI, Reactome emphasizes peer-reviewed curation and supports programmatic access for advanced users.[137][138] The Variant Effect Predictor (VEP) is a tool for interpreting the functional consequences of genetic variants, predicting impacts on transcripts, proteins, and regulatory regions while incorporating allele frequencies. It processes variant lists to prioritize clinically relevant changes, essential for genomic variant annotation. As an open-source, community-enhanced tool from EMBL-EBI, VEP offers flexible configurations for large-scale analyses.[139][68] These tools are accessible via intuitive web interfaces, RESTful APIs for programmatic integration, and downloadable software, with many embedded in the Galaxy workflow platform to streamline multi-step analyses. Overall, EMBL-EBI's analysis tools facilitate integrative bioinformatics by combining diverse data types, supporting over 100 million daily web and API requests across resources as of 2025.[129][140][141]

Research and Training

Research Programs

The European Bioinformatics Institute (EMBL-EBI) conducts computational research to address key biological challenges, aligning with EMBL's 2022–2026 programme, "Molecules to Ecosystems," which emphasizes understanding life in context through molecular mechanisms and ecosystem interactions.[26][142] This programme integrates EMBL-EBI's efforts in data science to explore microbial communities, human health, and environmental dynamics, fostering interdisciplinary approaches to accelerate discoveries in biology.[143] EMBL-EBI's research focuses on advancing AI applications in biology, including extensions beyond protein structure prediction like AlphaFold, to enable predictive modeling of biomolecular functions and interactions.[142] In single-cell genomics, EMBL-EBI maintains resources like the Single Cell Expression Atlas for analyzing immune cell responses and lineages, supporting studies in immunology and disease.[144] Microbiome analysis targets ecosystem-level insights, with the Microbiome Informatics team curating sequence data to annotate microbial diversity and functional roles in environmental and host contexts.[145] Key projects include the COVID-19 Data Portal, launched in 2020 and operational through 2025, which aggregates SARS-CoV-2 datasets for global research on viral evolution and therapeutics.[146][147] Sustainable bioinformatics efforts address climate impacts by providing open data resources for biodiversity monitoring, such as genomic sequences aiding species interaction studies and ecosystem resilience assessments.[148] Methodologies emphasize machine learning for integrating multi-omics data, enhancing pattern recognition across genomic, proteomic, and phenotypic datasets, alongside computational simulations to model molecular dynamics in biological systems.[59][142] Research outputs encompass high-impact peer-reviewed publications, with EMBL-EBI researchers contributing to hundreds annually, alongside software tools like those integrated into ELIXIR's OpenEBench platform for benchmarking bioinformatics methods.[149][150] Collaborations drive innovation, including the October 2025 renewal of the partnership with Google DeepMind for AlphaFold updates and EU consortia like the Federated European Genome-phenome Archive, highlighted in 2025 for advancing secure genomic medicine applications.[151][152][153]

Training Initiatives

The European Bioinformatics Institute (EMBL-EBI) delivers a comprehensive training programme designed to equip scientists worldwide with bioinformatics skills, emphasizing free and accessible learning opportunities. Central to this effort is Train Online, an e-learning platform offering on-demand tutorials and live webinars on EMBL-EBI's core resources, such as Ensembl and UniProt, with content tailored for users from beginners to advanced levels. Hands-on workshops and virtual events further support practical application, including sessions on tools like Ensembl for genomic data analysis. The programme reaches tens of thousands of participants annually, with over 67,000 users from 159 countries engaging in the 2024 AlphaFold online course alone, and broader EMBL training attracting 8,246 participants from 101 countries that year.[154][155] All offerings are provided at no cost, often culminating in certificates of completion to recognize skill acquisition.[156] The curriculum spans key areas of modern bioinformatics, including data analysis techniques for next-generation sequencing, adherence to FAIR (Findable, Accessible, Interoperable, Reusable) principles for data management, and applications of artificial intelligence in biology, such as protein structure prediction.[59] Materials are developed to foster conceptual understanding and hands-on proficiency, with examples like introductory modules on bioinformatics fundamentals and advanced topics in machine learning for life sciences.[156] EMBL-EBI's training team collaborates closely with the ELIXIR infrastructure to form a pan-European network, coordinating national efforts and integrating resources into a unified platform for bioinformatics education.[157] Looking ahead, expansions in 2025 include the launch of Ada, an AI-driven assistant to enhance personalized learning, alongside initiatives like the Human Ecosystems Retreat focused on modeling biological ecosystems through metagenomics and data integration.[158][159] These initiatives significantly build global capacity, particularly in low-resource regions, by enabling remote access and partnering with international networks to support underrepresented scientists.[160] A 2024 user survey revealed that 89% of respondents credited EMBL-EBI resources and training with enabling research that would otherwise be impossible, underscoring substantial skill improvement and broader scientific impact.[154]

References

User Avatar
No comments yet.