Hubbry Logo
DNA Data Bank of JapanDNA Data Bank of JapanMain
Open search
DNA Data Bank of Japan
Community hub
DNA Data Bank of Japan
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
DNA Data Bank of Japan
DNA Data Bank of Japan
from Wikipedia
Not found
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The DNA Data Bank of Japan (DDBJ) is a public repository for nucleotide sequence data, established in 1987 at the National Institute of Genetics (NIG) in Mishima, , as the sole such archive in . As a founding member of the International Nucleotide Sequence Database Collaboration (INSDC), DDBJ collaborates with the (NCBI) in the United States and the European Nucleotide Archive (ENA) at the European Bioinformatics Institute (EBI) in the to ensure synchronized, non-redundant global distribution of annotated DNA and RNA sequences. It accepts submissions primarily from Japanese and Asian researchers, issues unique accession numbers, and provides free public access to the data via web interfaces, FTP, and integrated search tools. DDBJ's core mission is to serve as a foundational public resource for life sciences by archiving high-quality, annotated sequences and related metadata, supporting research in , , and . Since its inception, it has expanded beyond traditional archiving to manage additional INSDC components, including the DNA DataBank of Japan (DRA) for raw sequencing reads since 2009, as well as BioProject and BioSample databases for project and sample metadata. In 2023, DDBJ processed 5,822 sequence submissions (74.5% domestic) and 94,299 DRA runs, contributing 4.21% of the INSDC's total sequences and 2.22% of base pairs in its June 2024 release, which encompassed 4.56 billion sequences and 31.88 trillion base pairs globally. Beyond INSDC duties, the DDBJ at NIG operates specialized databases such as the Japanese Genotype-phenotype (JGA) for controlled-access human genomic since 2013, in partnership with the National Bioscience Database (NBDC) and international equivalents like dbGaP and EGA, and the Genome Expression (GEA) for . It also maintains MetaboBank for datasets, renewed in 2021 to support MAGE-TAB formatting, and provides access to the NIG supercomputer for over 1,700 Japanese investigators annually to analyze large-scale . Collaborations extend to regional partners like the Korea Bioinformation (KOBIC) and patent offices in , the , and , facilitating data exchange for and . By August 2024, DRA holdings reached 19.4 petabytes of distributed , underscoring DDBJ's role in handling the exponential growth of sequencing technologies. Recent advancements include the DDBJ Group Cloud service for sharing pre-publication and the TogoVar-repository for human genetic variants, launched in 2024.

History

Founding and Establishment

The DNA Data Bank of Japan (DDBJ) was established in 1987 at the National Institute of Genetics (NIG) in Mishima, , to manage the burgeoning volume of sequence data generated by advancing genomic research. This initiative responded to the exponential growth in following the development of technologies in the mid-1970s, particularly after the 1975 Asilomar Conference, which established safety guidelines that accelerated research and data production without centralized archival systems overwhelming existing efforts in Europe and the . Preparatory activities commenced in 1986 through an informal collaboration with the (EMBL) Data Library and , focusing on data exchange protocols via magnetic tapes to ensure non-duplication and global accessibility; DDBJ formally joined this partnership in 1987, forming the foundational framework for the International Sequence Database Collaboration (INSDC). The collaboration addressed the practical challenges of synchronizing databases amid rapidly expanding sequence submissions, with DDBJ emphasizing contributions from Asian researchers while integrating international standards. DDBJ's inaugural data release occurred in 1987, comprising 66 entries totaling 108,970 base pairs, accompanied by the introduction of a unique accession numbering system to uniquely identify sequences and facilitate cross-database tracking. Initial operations were supported by funding from Japan's Ministry of Education, which covered staff and computational infrastructure costs, though resource limitations initially slowed progress. This government-backed investment prioritized bioinformatics development to sustain Japan's role in global genomic .

Key Milestones and Developments

A key milestone occurred in 2017 when DDBJ marked its 30th anniversary since its founding in 1987, commemorated by an international at the National Institute of Genetics titled "Life, Environment, and Evolution Revealed by Genomes," which underscored the organization's pivotal role in global genomic . By this time, DDBJ's periodical release statistics reflected substantial growth, with total sequence entries surpassing 874 million and base pairs exceeding 2.4 trillion, far exceeding the 100 million sequence threshold and demonstrating the exponential increase in submitted data driven by advances in sequencing technologies. Data volume milestones trace back to the mid-1990s, when DDBJ's releases reached approximately 1 million entries by 1995 amid rising international collaborations, laying the foundation for its expansion as Asia's primary repository. The 2022 DDBJ update report highlighted ongoing , with the acceptance of new data types such as metagenome-assembled genomes—evidenced by a 128% surge in submissions in 2021, including MAGs from the OceanDNA catalog with over 50,000 prokaryotic genomes from marine environments. In response to the , DDBJ intensified data sharing efforts by rapidly releasing viral genome sequences through collaboration with and the International Nucleotide Sequence Database Collaboration partners, facilitating global research and surveillance. In 2023, DDBJ continued enhancements to MetaboBank for data and expanded collaborations. The 2024 update introduced the DDBJ Group Cloud service, enabling secure sharing of pre-publication among collaborators. As of June 2025, DDBJ Release 138 reflected continued growth in sequence submissions and database expansions.

Organization and Governance

Location and Infrastructure

The DNA Data Bank of Japan (DDBJ) is primarily located at the National Institute of Genetics (NIG) in Mishima, , , with its main address at 1111 Yata, Mishima, Shizuoka 411-8540. As part of the Research Organization of Information and Systems (ROIS) since 2004, the DDBJ operates within NIG's facilities, which support its role in nucleotide sequence data management. This integration enhances collaborative research infrastructure across ROIS institutions. DDBJ's operational infrastructure includes resources, such as the NIG system, which provides cluster-based computation for , , and analysis of large-scale genomic datasets. The , upgraded to its sixth phase in 2025, serves as a core platform for building archival databases and handling tasks, including next-generation sequencing analysis. Secure data centers at NIG ensure long-term archival storage of sequences, aligning with the need for reliable preservation in a high-volume environment. NIG facilities supporting DDBJ include venues for international events, such as symposium halls used for the annual NIG International Symposium, which facilitates global researcher gatherings and knowledge exchange. These spaces accommodate visitors and promote international collaboration in genetics research. For , DDBJ employs backup and redundancy systems compliant with International Nucleotide Sequence Database Collaboration (INSDC) standards, ensuring synchronized data exchange and preservation across partner institutions like and EMBL-EBI. In the , DDBJ expanded its capabilities through cloud integration, notably via the DDBJ Read Annotation launched in 2013, which leverages for scalable high-throughput analysis of sequencing data. This development, building on earlier initiatives like the 2011 DDBJ , allows efficient processing without relying solely on local hardware, addressing the growing demands of genomic data volume.

Leadership and Structure

The Bioinformation and DDBJ Center, which encompasses the DNA Data Bank of Japan (DDBJ), is headed by Masanori Arita as of 2025, who oversees operations and reports to the director-general of the National Institute of Genetics (NIG) within the Research Organization of Information and Systems (ROIS). This leadership structure ensures alignment with NIG's broader mission in genetics research and bioinformatics infrastructure. Arita's role involves coordinating data management, international collaborations, and computational resources to support global nucleotide sequence archiving. The center's administrative organization features specialized divisions focused on core functions. The Database Division, led by Takatomo Fujisawa, handles management, , and public release. Complementary divisions include International Affairs under Yasukazu Nakamura for global data exchange and standards, Internal Affairs directed by Tomoya Tanjo for operational and metadata coordination, and headed by Osamu Ogasawara for bioinformatics research and analysis support. These units collectively manage data alongside broader and metadata resources, with staff comprising curators responsible for , bioinformaticians developing analytical tools, and IT specialists maintaining infrastructure. Governance includes advisory mechanisms to uphold ethical standards and international alignment. Externally, engages through the INSDC coordination group to synchronize policies and data sharing with partners like NCBI and ENA. The DNA Database Advisory Committee, an independent NIG panel, provides expert review on operational and strategic matters. The structure evolved significantly following NIG's 2019 reorganization, when the DDBJ Center was renamed the Bioinformation and DDBJ Center to reflect expanded scope; this merger integrated data handling and metadata teams, enhancing support for high-throughput sequencing and beyond traditional archives. This adaptation built on prior developments, such as the 2011 launch of the DDBJ Archive, to accommodate growing data volumes from next-generation technologies.

Core Functions and Services

Data Submission and Collection

Researchers submit DNA and related to the DNA Data Bank of (DDBJ) through dedicated portals designed to facilitate efficient data intake while ensuring quality and compliance with international standards. The primary web-based tool is the DDBJ Submission System (NSSS), an interactive platform that allows users to enter assembled sequences along with required annotations via a user-friendly web form. For larger-scale submissions, the Submission System (MSS) enables the upload of text files containing multiple sequences and annotations, supporting high-volume such as complete genomes. Specialized tools like the DDBJ Fast Annotation and Submission Tool (DFAST) assist with annotation and preparation for MSS submission, streamlining the process for prokaryotic . DDBJ operates under policies that promote open science, providing free and unrestricted public access to all submitted data in alignment with the International Nucleotide Sequence Database Collaboration (INSDC) guidelines. Submissions must include mandatory annotations, such as coding sequences (CDS), ribosomal RNA (rRNA), and transfer RNA (tRNA) features, to be accepted; unannotated raw reads are directed to the Sequence Read Archive (SRA) instead. Upon successful validation, DDBJ issues unique accession numbers in the standard INSDC format, typically consisting of one or two alphabetic prefixes followed by digits (e.g., AB123456 for standard entries or AAAA01000001 for whole genome shotgun projects), ensuring permanent and internationally recognized identifiers for each record. The supported data types encompass a wide range of sequences, including individual genes, complete genomes, metagenomes, and transcriptomes, with a particular emphasis on contributions from Japanese and Asian research communities. For instance, metagenome assemblies are accepted through coordinated submissions involving BioProject, BioSample, SRA, and DDBJ, allowing analysis of uncultured microbial communities. Validation occurs in multiple stages: initial automated format checks using tools like Utilities for MSS file Error check (UME) to verify file structure, sequence integrity, and syntax; followed by curator-led reviews for biological consistency, such as organism , feature locations, and adherence to INSDC standards. Only validated submissions proceed to archiving and accession assignment. In recent years, DDBJ has seen robust growth in submissions, reflecting its role in global . For example, in 2023, DDBJ processed 5,822 submissions, with 74.5% originating from domestic Japanese groups and the remainder from international contributors worldwide. This volume contributes to the overall INSDC database, where DDBJ accounts for about 4-5% of total , underscoring its scale in handling over thousands of entries annually while maintaining rigorous intake processes.

Annotation and Curation Processes

The DNA Data Bank of Japan (DDBJ) employs standardized annotation guidelines to ensure the accuracy and interoperability of sequence data, primarily through the adoption of the International Nucleotide Sequence Database Collaboration (INSDC) feature table format. This format utilizes controlled vocabularies for feature keys—such as "" for describing genomic regions and "" for specifying splice variants—and qualifiers like /gene, /locus_tag, and /product to denote functional elements. For instance, the "CDS" feature delineates protein-coding sequences with locations from initiation to termination codons, incorporating qualifiers for and . These guidelines mandate the use of precise, case-insensitive terms to avoid ambiguity, with appendices defining vocabularies for base codes, genetic tables, and modified residues. Integration of ontologies enhances annotation depth, particularly through the /db_xref qualifier, which links features to external resources including (GO) terms for molecular function, biological process, and cellular component descriptions (e.g., /db_xref="GO:0003674" for DNA binding). This allows submitters to reference GO identifiers directly within CDS or gene features, promoting semantic consistency across databases. DDBJ's guidelines emphasize minimal yet comprehensive descriptions, requiring at least a source feature with organism taxonomy and optional qualifiers for strain or isolate details, while encouraging evidence-based annotations to support downstream analyses. The curation at DDBJ combines automated validation with manual to refine submitted post-validation. Upon submission, automated tools initially parse files for syntactic compliance, checking feature locations, qualifier formats, and translation integrity for CDS features to detect inconsistencies like frame shifts or invalid stops. Subsequent similarity checks against existing DDBJ entries use algorithms to identify potential duplicates or overlaps, flagging entries for curator intervention. curators, trained in , perform manual reviews for complex annotations, verifying biological plausibility, such as gene-exon alignments in eukaryotic submissions or structures in prokaryotes, and resolving ambiguities through correspondence with submitters. This hybrid approach has evolved toward greater , reducing processing time while maintaining quality, as evidenced by the of validators that auto-correct minor format issues before human oversight. DDBJ has developed specialized tools to support and curation, including the All-Round Retrieval of Sequence and (ARSA) system, which enables efficient searching of reference and annotations from DDBJ and to guide manual feature assignment. ARSA facilitates quick retrieval by keywords, , or similarity, aiding curators in identifying homologous regions for consistent application. Complementing this, the DDBJ Read Archive (DRA) handles raw sequencing reads with metadata , archiving alignment information and experiment details in standardized formats like XML, which undergo curation for before integration with assembled sequences. Additional utilities, such as the Parser for format validation and TransChecker for CDS translation verification, automate preliminary in the . Error correction and updates follow rigorous protocols to preserve , with version tracking ensuring traceability of changes. Submitters can request corrections via updated submissions, which generate new entry versions while retaining historical ones accessible through accession numbers; retracted or erroneous records are marked as suppressed in public releases but archived indefinitely for audit purposes. Curators handle retractions by notifying collaborators in the INSDC network, preventing propagation of errors across and EMBL-EBI, and apply fixes such as sequence revisions or feature relocations only after verification. This system supports ongoing maintenance, with periodic releases reflecting curated updates to eliminate inaccuracies without data loss. DDBJ ensures compliance with INSDC standards to facilitate global , enforcing minimal requirements such as a defined locus name, source feature, and basic qualifiers for . For assemblies, submissions must include assembly-level metadata like contig counts and gap descriptions, adhering to INSDC guidelines for structured reporting via AGP files. These standards, updated collaboratively, require qualifiers (e.g., /evidence="experimental") and prohibit unsubstantiated claims, with DDBJ validators enforcing them during curation to align with partner databases.

Databases and Resources

Primary Nucleotide Sequence Database

The Primary Nucleotide Sequence Database of the DNA Data Bank of Japan (DDBJ) serves as the core archival repository for annotated nucleotide sequences, encompassing a wide range of data types such as complete genomes, expressed sequence tags (ESTs), complementary DNAs (cDNAs), and various RNA sequences, including messenger RNA (mRNA), ribosomal RNA (rRNA), and non-coding RNAs. As a member of the International Nucleotide Sequence Database Collaboration (INSDC), DDBJ maintains a comprehensive, non-redundant collection of these sequences submitted by researchers worldwide, ensuring global accessibility and synchronization with partner databases like GenBank and the European Nucleotide Archive (ENA). The database emphasizes high-quality annotations that describe biological features, such as gene locations, protein products, and functional elements, adhering to standardized INSDC feature table definitions. As of September 2025, the DDBJ nucleotide database (Release 139.0) mirrors the full INSDC archive, containing 5,925,566,790 assembled and annotated entries comprising 47,673,721,557,094 bases, with daily updates to reflect new submissions and incorporate exchanges from INSDC partners. This vast scale underscores DDBJ's role in supporting genomic research, where the majority of entries derive from high-throughput sequencing projects, including metagenomes and transcriptomes, while maintaining archival integrity for historical dating back to 1987. Users can retrieve individual entries or batches of data using the GETENTRY tool, which allows querying by accession number and supports programmatic access via for efficient bulk downloads. For visual exploration of genomic regions, the provides an interactive interface to view annotations, alignments, and sequence features in a graphical format. Additionally, flatfile downloads of the entire release or subsets are available via FTP, enabling offline analysis. Search functionalities include text-based queries through the DDBJ Search portal, which supports advanced filtering by keywords, organism taxonomy, authors, or publication details across the archive. For sequence similarity analysis, DDBJ integrates BLAST (Basic Local Alignment Search Tool) services, allowing users to perform nucleotide-to-nucleotide or translated searches against the database to identify homologous sequences. Data are exported in multiple formats to accommodate diverse applications: the standard EMBL-style flatfiles, which include detailed headers, feature tables, and sequence lines for human-readable ; FASTA format for simplified sequence-only retrieval; and XML (specifically INSD-XML) for structured, machine-readable representation that facilitates programmatic parsing and integration with bioinformatics pipelines. These formats ensure compatibility with global standards while supporting curation standards briefly referenced in DDBJ's processes.

Specialized and Supporting Databases

The DNA Data Bank of Japan (DDBJ) maintains several specialized databases that complement its primary nucleotide sequence repository by archiving raw data, metadata, and information essential for comprehensive biological research. These resources support the storage and accessibility of high-throughput sequencing outputs, experimental metadata, and phenotype-linked genetic data, ensuring and integration across global standards. The DDBJ Read Archive (DRA) serves as the primary repository for raw sequencing reads and alignment data from high-throughput technologies, including formats compatible with the Sequence Read Archive (SRA) schema. Established to handle the surge in next-generation sequencing data, DRA archives petabyte-scale datasets, enabling researchers to access unprocessed reads for reanalysis and validation of genomic studies. Its unique features include support for diverse sequencing platforms and mandatory linkage to associated metadata, facilitating traceability in large-scale projects. Complementing DRA, the Genomic Expression Archive (GEA), formerly known as the DDBJ Omics Archive (DOR), is a public repository for data, encompassing profiles, epigenomic modifications, and SNP genotyping arrays from and sequencing experiments. Launched in 2018 to succeed DOR, GEA adheres to MIAME-compliant standards for data submission, promoting standardized archiving of quantitative datasets that reveal regulatory mechanisms in biological systems. It uniquely emphasizes integration with Japanese research initiatives, hosting collections from domestic multi-omics studies on and development. DDBJ also curates supporting metadata databases in collaboration with international partners, including BioProject and BioSample, which catalog experimental designs and biological sample descriptions, respectively. BioProject assigns unique identifiers to sequencing and projects, while BioSample provides detailed attributes like and collection conditions, ensuring contextual richness for data interpretation. These resources synchronize daily with counterparts at the European Nucleotide Archive (ENA) and NCBI, maintaining a unified global metadata framework. For legacy Sanger sequencing, DDBJ formerly contributed to the INSDC-shared Trace Archive of chromatogram traces, base calls, and quality scores from single-pass reads in large-scale projects. These data are now preserved in the DDBJ Read Archive (DRA), allowing verification of early genomic assemblies and supporting comparative analyses with modern high-throughput outputs. A distinctive Japanese-focused resource is the Japanese Genotype-phenotype Archive (JGA), a controlled-access database for individual-level genetic variants and associated phenotypic information, particularly from population studies involving Japanese cohorts. JGA ensures ethical data sharing through review by the National Bioscience Database Center (NBDC), granting access only to approved researchers while enabling public release post-publication. It hosts datasets on ethnic-specific genomic variations, such as those linked to disease susceptibility in Japanese populations, without compromising participant privacy. MetaboBank is a public repository for metabolomics data obtained from (MS), (NMR), and imaging MS techniques. It supports standardized submissions in mzTab-M format and was redesigned in 2023 to enhance metadata handling and with global metabolomics standards. These specialized databases integrate seamlessly with external curated resources, such as NCBI's , by providing hyperlinks from DDBJ entries to non-redundant reference sequences, enhancing cross-database navigation for researchers. Overall, DDBJ's supporting infrastructure underscores its role in managing diverse data types while prioritizing within the International Nucleotide Sequence Database Collaboration (INSDC).

International Collaborations

Role in the International Nucleotide Sequence Database Collaboration

The International Nucleotide Sequence Database Collaboration (INSDC) originated from a tripartite agreement formalized in 1987 among the DNA Data Bank of Japan (DDBJ), the European Nucleotide Archive (ENA, formerly EMBL Nucleotide Sequence Database), and GenBank at the National Center for Biotechnology Information (NCBI), with the goal of establishing non-redundant, comprehensive global coverage of nucleotide sequence data through coordinated archiving efforts. As the Asian partner in this collaboration, DDBJ serves as the primary repository for submissions from researchers in the Asian and Pacific regions, thereby ensuring equitable representation of international scientific contributions and preventing data silos across continents. A cornerstone of DDBJ's role in INSDC is the daily synchronization of data via flatfile exchanges with and ENA, which guarantees that newly released s and annotations are identically mirrored across all three databases within a short timeframe, typically one to two days. This process relies on shared standards, such as the common feature table format for describing features, enabling seamless and a unified global resource for researchers worldwide. DDBJ's adherence to this exchange protocol has been integral since its inception, supporting the INSDC's mission of preserving and disseminating public-domain data without duplication. DDBJ plays a key role in shaping INSDC-wide policies, including standardized accessioning practices that assign unique, stable identifiers to sequences upon submission, ensuring and citability. It also contributes to protections, such as implementing controlled-access mechanisms for to safeguard sensitive genomic information while complying with ethical guidelines like those from the Global Alliance for Genomics and Health. Additionally, DDBJ helps formulate joint release policies, including "hold-until-published" options that allow submitters to delay public availability until associated is disseminated, balanced against timely access for the broader community. These policies are developed collaboratively to maintain fairness and compliance across the partnership. DDBJ actively participates in INSDC's annual coordination meetings, where it collaborates with ENA and representatives to address operational challenges, refine data models, and incorporate technological updates, such as enhancements for metagenomic submissions. For example, DDBJ hosted the 36th INSDC meeting in Mishima, , in May 2023, facilitating discussions on improving and integration standards. DDBJ participated in the 37th INSDC meeting in June 2024 in Bethesda, USA, and the 38th in June 2025 in , UK, continuing discussions on data standards and updates. These gatherings ensure ongoing alignment and adaptability within the collaboration. Through its INSDC contributions, DDBJ has helped build a robust global archive, accounting for approximately 4-5% of total sequences in recent releases (e.g., 4.21% in the 2024 flatfile), with a notable emphasis on viral and microbial sequences that reflect regional priorities in . This share underscores DDBJ's impact in diversifying the database with data from underrepresented geographic areas, enhancing the INSDC's overall comprehensiveness for fields like infectious and studies.

Partnerships with Global Institutions

DDBJ maintains bilateral agreements with the (NCBI) in the United States, supporting joint workshops on bioinformatics and data management to foster knowledge exchange among researchers. Similarly, DDBJ collaborates with the (EBI) on tool development initiatives, enabling the creation and refinement of software for and that benefits global users. These efforts complement the core data exchange framework while addressing specific regional and technical needs in bioinformatics infrastructure. In Asian networks, DDBJ has forged partnerships with the National Genomics Data Center (NGDC) in and the Korean Bioinformation Center (KOBIC) in Korea to promote regional data sharing and collaborative projects. A key example is the (MOU) signed with KOBIC in 2021, which facilitates personnel exchanges, joint research, and educational programs for depositing and analyzing from national initiatives. DDBJ and NGDC co-organize the annual Asian Bioinformatics Collaboration (ABC) Symposium, alongside KOBIC, to discuss advancements in and across . These networks enhance cross-border access to genomic resources and support harmonized standards for regional research. DDBJ accepts raw read data into its Sequence Read Archive (DRA) from various next-generation sequencing platforms, including those from providers like Illumina, ensuring seamless submission and archiving. This integration supports reproducibility in large-scale genomic studies by standardizing data formats and metadata for industry-generated datasets. Educational initiatives by DDBJ include training programs on bioinformatics conducted in partnership with universities in , such as , and abroad through international workshops. Notable past examples include the Japan-Korea-China Bioinformatics Training Course (2002–2014), which provided hands-on instruction in sequence data submission, , and database utilization for participants from the three countries. These programs aim to build capacity in bioinformatics skills among early-career researchers and promote global standards in data handling. Funding partnerships support DDBJ's -related activities through joint grants from the Japan Society for the Promotion of Science (JSPS) and international bodies like the U.S. (NSF). The JSPS-NSF Partnerships program funds collaborative projects between Japanese and U.S. researchers, often involving DDBJ resources for data archiving and analysis in areas like and multi- integration. Additionally, JSPS's KAKENHI grants, including those under the Platform for Advanced Genome Science (PAGS), provide resources for bioinformatics tool development and large-scale data .

Research and Innovations

Information Biology Initiatives

Information biology at the DNA Data Bank of Japan (DDBJ) integrates computational methods with biological sciences to analyze and interpret large-scale genomic data, with initiatives formally expanding through the Center for Information Biology established in 1995 and continuing post-2000 under the Bioinformation and DDBJ Center. This scope emphasizes leveraging nucleotide sequence archives for advancing studies, , and data-driven biological insights, distinct from pure informatics by prioritizing biological applications such as gene regulation and evolutionary patterns. Key projects include the development of ontologies for sequence annotation, such as the DDBJ annotated sequence , which semantically represents genomic features in RDF format to enhance data and annotation accuracy. Additionally, DDBJ has initiated AI-driven tools, exemplified by the DDBJ Data Analysis Challenge, a competition using DDBJ resources to predict features and gene functions from DNA sequences in model organisms like . Research outputs from these initiatives include publications on data-driven , such as phylogenomic analyses of bacterial genera using DDBJ-deposited genomes to reconstruct dynamic and events. For instance, studies have utilized DDBJ data to propose refined phylogenies for , revealing insights into strain-specific adaptations. DDBJ supports training and education through workshops and programs like the D-STEP (DDBJ Supercomputer Training & Educational Program), which teach researchers to apply DDBJ resources for hypothesis generation in evolutionary and . These sessions, often in with international partners, focus on practical bioinformatics skills for biological discovery. Ethical considerations in DDBJ's information biology research address data privacy in genomic studies via the Japanese Genotype-phenotype Archive (JGA), which implements controlled-access policies aligned with the NBDC Human Data Sharing Guidelines to protect participant rights while enabling secure sharing. These standards ensure compliance with Japan's Ethical Guidelines for Medical and Biological Research Involving Human Subjects, mirroring international frameworks like GDPR in restricting re-identification risks.

Technological Advancements and Updates

The DNA Data Bank of (DDBJ) has continually advanced its software tools to facilitate efficient data submission, retrieval, and analysis. A key development is the Web API for Biology (WABI), a REST-based application programming interface introduced in 2007 and subsequently enhanced to enable programmatic access to DDBJ's search systems, homology search tools like BLAST, and data retrieval services. This API supports integration with external workflows, allowing researchers to automate queries and analyses without manual web interactions. Complementing this, the DFAST (DDBJ Fast Annotation and Submission Tool) provides an automated pipeline for prokaryotic genome annotation, generating submission-ready files for DDBJ while incorporating structural and functional predictions. In , DFAST was extended with DFAST_VRL, a specialized version for annotating genomes using NCBI's Virus Annotation and Discovery Resource (VADR), released as an open-source standalone tool to accelerate viral . System updates in recent years have focused on scaling for high-throughput , particularly metagenomic sequences. The DDBJ Sequence Read Archive (DRA) was bolstered in 2022 to handle a 128% increase in third-party (TPA) entries for metagenome-assembled genomes, enabling bulk submissions and improved metadata integration for raw sequencing . As of August 2024, DRA provides 19.4 petabytes of sequencing . Concurrently, the Mass Submission System (MSS) launched an online application form in June 2022, streamlining the intake of large-scale datasets via text files and reducing submission bottlenecks for voluminous genomic projects. For enhanced querying, DDBJ Search was refactored in November 2021 to incorporate , expanding coverage to include DRA, BioProject, and BioSample metadata, which supports faster federated searches across integrated resources. Additionally, the beta release of DDBJ Workflow Execution Service (WES) in 2022 adheres to Global Alliance for Genomics and Health (GA4GH) standards, allowing remote execution of analysis pipelines on the NIG , including metagenomic assembly tools. In 2024, DDBJ introduced the TogoVar-repository in September for archiving genetic variants, issuing accession numbers for short (<50 bp) and large (>50 bp) variants in collaboration with NCBI's dbSNP and dbVar; initial datasets include structural variants (dstd1) and 200 million SNVs/INDELs from 9,287 individuals (dstd2). The DDBJ Group Cloud service expanded in April 2024 to facilitate sharing of pre-publication microbial and multi-omics datasets through the GteX Program, with BioSample validation enhanced by a new 'plant package'. MetaboBank was updated to support the mzB format for data, including free viewers and libraries. These enhancements align with INSDC's March 2023 spatiotemporal metadata standards and April 2023 missing value reporting guidelines to improve data FAIRness. Effective January 2025, INSDC discontinued new submissions of TPA-experimental and TPA-inferential types, continuing only TPA-assembly to streamline annotation processes. Adoption of and technologies has modernized DDBJ's infrastructure for automated and . The DDBJ Read , a computing-based system operational since 2013, processes next-generation sequencing through decentralized high-throughput on the NIG , with updates integrating over 2,000 software containers by 2022, such as those for . This facilitates automated workflows, reducing manual curation time for large datasets. In terms of integration, DDBJ's 2020 Data Challenge demonstrated the application of ML models to predict chromatin feature annotations from DNA sequences, engaging participants in crowdsourced development of predictive tools that inform automated strategies. Performance metrics from infrastructure upgrades highlight these gains: the 2019 NIG replacement increased storage capacity to petabyte-scale and boosted transfer speeds via 10 Gbps Aspera servers, enabling query response times for large genomic datasets to improve from minutes to seconds in homology searches. Earlier refactoring in 2012 further amplified CPU throughput by over fivefold for standalone operations. In February 2025, the NIG announced a major upgrade, with the current system shutting down on February 21, 2025, and the next-generation system launching on March 1, 2025. Enhancements include doubled CPU core performance, new GPU nodes (L40S available April 2025, B200 in June 2025), a 100 Gbps network connection via SINET6, and a switch to the Slurm job scheduler from Grid Engine, running Linux 24.04. This renewal, including a planned database storage replacement to handle 60–80 PB, supports over 1,700 investigators annually in analyzing large-scale with improved for DDBJ's research initiatives. Looking ahead, DDBJ's enhancements emphasize and , with ongoing expansions to the unified platform—introduced in 2020 and integrated into tools like DFAST and MSS by — to support secure, single-sign-on access across services, paving the way for broader cloud-based collaborations.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.