Recent from talks
Nothing was collected or created yet.
Gene transfer format
View on WikipediaThe Gene transfer format (GTF) is a file format used to hold information about gene structure. It is a tab-delimited text format based on the general feature format (GFF), but contains some additional conventions specific to gene information. A significant feature of the GTF that can be validated: given a sequence and a GTF file, one can check that the format is correct. This significantly reduces problems with the interchange of data between groups.
GTF is identical to GFF, version 2.[1]
References
[edit]- ^ GFF/GTF info, from Ensembl
External links
[edit]Gene transfer format
View on Grokipediagene_id and transcript_id to hierarchically link features like coding sequences, start/stop codons, and UTRs, ensuring complete gene models can be reconstructed.[1]
GTF files are widely adopted in major genomic resources, such as Ensembl and the UCSC Genome Browser, for distributing annotation datasets across species like humans and mice, and they support validation against reference sequences to minimize data exchange errors.[5][3] Unlike broader GFF variants, GTF emphasizes gene-specific features—limiting UTR annotations to mRNA genes and excluding stop codons from terminal CDS regions—and allows optional comments or track lines for visualization in genome browsers.[1] This format's simplicity and extensibility have made it a cornerstone for tools in gene prediction, RNA-seq analysis, and comparative genomics.[2]
Introduction
Definition and Purpose
The Gene Transfer Format (GTF) is a tab-delimited text file format specifically designed for representing gene annotations in genomics, including key structural elements such as exons, coding sequences (CDS), untranslated regions (UTRs).[1] It builds upon the General Feature Format (GFF) by incorporating a restricted vocabulary and mandatory attributes tailored to gene-centric data, ensuring compatibility while adding specificity for hierarchical features like transcripts and their components.[1] The primary purpose of GTF is to enable the standardized storage, exchange, and computational analysis of gene structure information, allowing bioinformatics software and databases to process and integrate annotations efficiently without loss of detail.[1] This format supports the representation of complex gene models, such as alternative splicing variants, by linking individual features through unique identifiers, which facilitates tasks like genome visualization, expression quantification, and variant annotation in pipelines such as those used by Ensembl.[6] GTF offers key advantages in reducing interchange errors through its machine-readable structure, which permits validation of features against reference genomes, and its focus on gene hierarchies that streamline parsing for downstream applications like RNA-seq alignment and differential expression analysis.[1] By prioritizing gene-specific details over general genomic features, it enhances interoperability across tools while minimizing ambiguity in data interpretation.[2] Emerging in the early 2000s as a refinement of Ensembl's annotation needs, GTF addressed limitations in broader formats by providing a more precise mechanism for gene transfer between collaborators and systems.[1]Relation to GFF
The Gene Transfer Format (GTF) is an extension of the General Feature Format version 2 (GFF2), retaining the core nine-column tab-delimited structure while introducing specialized conventions tailored for representing hierarchical gene and transcript structures in genomic annotations. Developed around 2000 during early large-scale genome projects such as the Human Genome Project and Drosophila annotation efforts, GTF builds directly on GFF2's foundational design, which was proposed in 2000 by the Sanger Institute to standardize feature descriptions across sequences. This extension was necessary to address the limitations of GFF2's broad applicability, which allowed diverse feature types and flexible attribute handling that often resulted in inconsistent representations of gene models across collaborating research groups.[1][4] Key modifications in GTF from GFF2 include restrictions on the third column (feature type) to a limited set of gene-centric terms, such as "gene," "transcript," "exon," and "CDS," to enforce uniformity in describing transcript architectures, and the standardization of the ninth column (attributes) with mandatory key-value pairs like "gene_id" and "transcript_id" for explicitly linking parent-child relationships between features. These changes enable precise modeling of multi-transcript genes without relying on GFF2's more permissive group field, which could lead to ambiguous hierarchies. For instance, while GFF2 permitted arbitrary feature ontologies and unstructured notes, GTF mandates quoted attribute values and prohibits extraneous fields to streamline parsing for gene-focused analyses.[1][7][4] GTF maintains backward compatibility with GFF2 parsers, as its files conform to the same columnar layout and can be interpreted as generic feature sets, though this sacrifices GTF's specialized gene-linking semantics and may introduce parsing errors in tools expecting broader GFF flexibility. In contrast, the subsequent GFF3 format, released in 2004, extends beyond GTF by incorporating controlled vocabularies from the Sequence Ontology, support for arbitrary-depth feature hierarchies via parent-child ID references, and multi-file dataset coordination, features that GTF does not natively include to preserve its simplicity for transcript-level annotations. This evolutionary progression highlights GTF's role as a targeted refinement of GFF2 for gene-centric tasks, while GFF3 addressed broader annotation needs in subsequent projects.[5][7][4]History
Development
The Gene Transfer Format (GTF) was developed around 2000-2001 by researchers at the Wellcome Sanger Institute through the Ensembl project to provide a standardized method for representing gene annotations in eukaryotic genomes such as the human genome.[8][9] This initiative addressed the pressing need for consistent data exchange amid the rapid expansion of genomic data during the Human Genome Project era, where diverse ad-hoc annotation files hindered interoperability across sequence analysis pipelines and computational tools.[9][10] The initial specification of GTF built directly on the GFF2 format, adapting its nine-column structure to emphasize gene-centric features like exons, introns, and transcripts while introducing stricter conventions for attributes to facilitate automated processing.[1] First informal usage appeared in early Ensembl database releases, enabling users to download annotations in this format from the project's website, though no formal Request for Comments (RFC) process was undertaken; instead, the format was primarily documented in associated software manuals and early bioinformatics tool descriptions.[9][1] Development was driven by the Ensembl team at the Sanger Institute, with key indirect influences from Michael Brent's group at Washington University in St. Louis, whose work on gene prediction tools like TWINSCAN helped shape early adoption and later refinements, culminating in their formalization of the GTF 2.2 specification.[9][1] This foundational effort prioritized practical utility for genome browsers and annotation pipelines over exhaustive generality, laying the groundwork for widespread use in eukaryotic genomics without initial emphasis on versioning or extensibility.[4]Versions and Evolution
The Gene Transfer Format (GTF) originated from the Ensembl project at the Wellcome Trust Sanger Institute, building on the General Feature Format (GFF) to better represent gene structures. GTF emerged around 2000 in the context of the Drosophila melanogaster genome annotation efforts, adapting GFF for gene structures.[4] Early versions of GTF were adapted by projects such as WormBase for nematode genome annotations, highlighting its utility in handling complex eukaryotic gene models.[11] In 2003, GTF2.2 was formalized by Michael Brent's laboratory at Washington University in St. Louis, introducing stricter rules for the attributes field to enhance consistency and usability.[1] This version mandated attributes likegene_id and transcript_id for all features, ensuring hierarchical linking of exons, introns, and coding sequences, while adding frame information (0, 1, or 2) to the CDS features to indicate the reading frame offset for codon alignment.[1] These changes evolved from the initial GTF by incorporating refinements to robustly manage alternative splicing and multi-transcript genes, addressing limitations in representing transcript variants within a single gene locus.[1] GTF2.2 gained widespread adoption, notably in the UCSC Genome Browser, where it remains the standard for displaying and exporting gene annotations due to its compatibility with GFF2 and support for transcript grouping.[12]
Following 2010, no major new versions of GTF emerged, but adaptations continued through project-specific extensions. Ensembl adopted a variant known as "Ensembl GTF," which includes extended attributes such as gene_biotype and transcript_biotype to classify genes and transcripts (e.g., protein-coding, pseudogene, or lncRNA), facilitating more detailed functional annotations.[13] This variant maintains backward compatibility with core GTF2.2 while allowing integration with broader genomic data; converters like gffread enable seamless transformation to and from GFF3 for handling non-gene features.[13][14]
As of 2025, the GTF format remains stable and is the primary output for major annotation consortia like GENCODE and Ensembl, with files distributed for human and mouse reference genomes.[15] It is often supplemented by GFF3 for comprehensive feature sets beyond genes, and minor refinements in the score column have supported probabilistic annotations, such as confidence values from machine learning-based predictors (e.g., 0-1000 scale for splice site reliability in tools like SpliceAI).[15][16]
Format Details
File Structure
The Gene Transfer Format (GTF) is a plain text file format designed for representing genomic features, consisting of one line per feature with exactly nine tab-delimited columns.[5] This structure extends the nine-column model of the General Feature Format (GFF) version 2, adapting it specifically for gene annotations while maintaining compatibility for parsing tools.[1] Files typically include optional header lines at the beginning, which start with a "#" character to denote metadata such as version information, source details, or track definitions, allowing for basic configuration without affecting the core data.[13] Each line in a GTF file represents a single genomic feature, such as a gene, transcript, or exon, and lines are conventionally ordered by sequence name (e.g., chromosome) followed by genomic position to facilitate efficient sequential parsing and indexing in bioinformatics workflows.[5] Coordinates within the file use 1-based integer indexing, ensuring start positions are less than or equal to end positions for consistency.[1] Data types across the columns include strings for identifiers and sources, integers for positional elements, optional floating-point values or placeholders (denoted by ".") for scores, specific characters ("+" or "-" for strand, 0/1/2 or "." for frame), with the final column containing structured attribute strings.[13] GTF files adhere to strict conventions for robustness: they must contain no blank lines, employ UTF-8 encoding for character compatibility, and use single-tab separation without extra whitespace between fields.[1] In the attributes column, key-value pairs are semicolon-separated, where each pair ends with a semicolon followed by exactly one space before the next pair (if any), so the line ends with a semicolon after the last pair. Each semicolon is followed by exactly one space before the next key (if any). No tabs or additional spaces are allowed within the attributes field. For validation, files should be checked for basic integrity, including coordinate ordering (start ≤ end on each line) and hierarchical consistency, such as ensuring child features (e.g., exons) fall within the bounds of their parent features (e.g., transcripts), often using dedicated tools like validate_gtf.pl from the Eval package.[13] These conventions promote interoperability across genome browsers and annotation pipelines, such as those in Ensembl and GENCODE projects.[1]Column Specifications
The Gene Transfer Format (GTF) consists of nine tab-delimited columns per line, each specifying essential aspects of genomic features in a gene-centric annotation.[1] These columns provide a structured representation of sequence landmarks, allowing for precise mapping and analysis of genes, transcripts, and related elements.[13]| Column | Name | Type | Description and Usage |
|---|---|---|---|
| 1 | seqname | String | Identifies the reference sequence, such as a chromosome (e.g., "chr1") or scaffold ID; it must align with the nomenclature of the reference genome assembly to ensure coordinate consistency.[1][13] Coordinates for features are unique within each seqname across the annotation set.[1] |
| 2 | source | String | Denotes the origin of the annotation, typically the name of a database, consortium, or prediction tool (e.g., "Ensembl" or "HAVANA"); this field serves as a unique identifier for the annotation provider.[1][13] In practice, it distinguishes between manual curation and computational predictions.[17] |
| 3 | feature | String | Specifies the type of genomic feature using a controlled vocabulary, such as "gene", "transcript", "exon", "CDS", "start_codon", or "stop_codon"; the gene-centric emphasis, including features like UTRs and introns specific to transcripts, differentiates GTF from the more general GFF format.[1][13] Feature types are case-sensitive and focus on elements relevant to gene structure and expression.[1] |
| 4 | start | Integer | Represents the 1-based genomic starting position of the feature; it must be less than or equal to the end position, with numbering beginning at position 1 of the sequence.[1][13] This inclusive coordinate system facilitates accurate feature localization.[17] |
| 5 | end | Integer | Indicates the 1-based genomic ending position of the feature, which is inclusive of the endpoint; it must be greater than or equal to the start position.[1][13] Extensions beyond the sequence boundaries are generally avoided to maintain annotation integrity.[1] |
| 6 | score | Float or '.' | Provides an optional numerical confidence score for the feature (e.g., ranging from 0 to 1000), or a period ('.') if no score is available; scores are relative to the source and can be used by tools for filtering or prioritization.[1][17] In many gene annotation datasets, this field is left as '.' due to the lack of standardized scoring.[13] |
| 7 | strand | String | Specifies the strand orientation of the feature as '+' for the forward strand, '-' for the reverse strand, or '.' for unstranded features (which is uncommon in gene annotations); this determines the directionality for interpreting coordinates and frames.[1][13] For reverse-strand features, the 5' end corresponds to the higher coordinate.[1] |
| 8 | frame | Integer or '.' | For coding sequence (CDS) features, indicates the reading frame as 0 (codon aligned such that the first base starts a codon), 1 (one base offset), or 2 (two bases offset); otherwise, it is '.'; this aids in calculating the correct translation frame from genomic coordinates.[1][13] On the reverse strand, frame calculation considers the end coordinate as the 5' reference.[1] |
| 9 | attributes | String | Contains semicolon-separated key-value pairs for additional metadata (e.g., "gene_id "value"; gene_type "protein_coding";"); values are typically enclosed in double quotes, and this column stores essential identifiers and properties without tabs or extra spaces between pairs.[1][13] Attributes are mandatory and include at least gene_id and transcript_id. |
Attributes and Conventions
Standard Attributes
The standard attributes in the Gene Transfer Format (GTF) are key-value pairs residing in the ninth column of each feature line, with gene_id and transcript_id being the core mandatory attributes designed to hierarchically link genomic elements such as genes, transcripts, and exons. These provide essential identifiers that enable the reconstruction of gene models from flat file data, while additional optional attributes are conventionally used in major annotations for enhanced metadata. GTF builds on the General Feature Format (GFF) by establishing these gene-focused conventions to promote interoperability across bioinformatics tools, though attributes remain flexible like in GFF version 2.[5][13] The gene_id attribute serves as a unique identifier for the gene locus, required for all features associated with a gene, such as transcripts and exons; it typically follows a stable naming convention like "ENSG000001" in Ensembl annotations. This attribute allows grouping of all downstream features under a single gene entity.[13][5] The transcript_id attribute provides a unique identifier for each transcript isoform within a gene, mandatory for features like exons and coding sequences (CDS) to link them to their parent transcript; examples include "ENST000003" in GENCODE files. It facilitates the distinction between alternative splicing variants of the same gene.[13][12] The gene_name attribute offers a human-readable symbol for the gene, such as "TP53", and is optional but widely included in annotations for intuitive reference and display purposes. Similarly, the transcript_name attribute assigns a readable name to the transcript, often formatted as "TP53-001", adhering to conventions that extend the gene_name for clarity. These names enhance accessibility without altering the functional uniqueness provided by IDs.[13][5] The exon_number attribute specifies the sequential position of an exon within its transcript, using integers starting from 1 at the 5' end (e.g., "1" for the first exon); it is optional in the core GTF but required in annotations like GENCODE for exon features to maintain order during model assembly. This numbering supports accurate splicing reconstruction.[13] GTF attributes are case-sensitive, meaning tags like "gene_id" must match exactly to avoid parsing errors. Values containing spaces, semicolons, or special characters must be enclosed in double quotes, and each attribute-value pair ends with a semicolon followed by a space; duplicate tags are not permitted within a single line to prevent ambiguity. These rules, placed in the ninth column, ensure consistent parsing across tools.[5][13][12] Collectively, these attributes enable bioinformatics pipelines to build comprehensive gene models by grouping and ordering features based on shared gene_id and transcript_id values, supporting applications from visualization to variant annotation.[13][5]Gene-Specific Conventions
In the Gene Transfer Format (GTF), gene-specific conventions extend the core structural attributes to incorporate biological classifications and annotations tailored to gene and transcript functionality. Thegene_type or gene_biotype attribute classifies the overall function of a gene, using standardized terms such as "protein_coding" for genes encoding proteins, "lncRNA" for long non-coding RNAs longer than 200 nucleotides without protein-coding potential, and "pseudogene" for genomic sequences homologous to functional genes but inactivated by mutations like frameshifts or premature stop codons.[13][18] These terms, originally defined in Ensembl annotations, have become widely adopted across GTF implementations to facilitate consistent categorization of genetic elements.[18]
Similarly, the transcript_type or transcript_biotype attribute applies to transcript-level features, mirroring gene classifications with examples including "mRNA" for mature messenger RNAs and "nonsense_mediated_decay" for transcripts containing premature termination codons more than 50 nucleotides upstream of the last exon-exon junction, which are targeted for degradation.[13][18] This distinction allows for nuanced representation of transcript variants within a single gene, such as alternative splicing products that differ in biotype.[13]
Additional biological tags provide linkages to external resources and manual annotations. The havana_gene attribute, sourced from the Havana manual curation project, assigns a unique identifier in the format OTTHUMGXXXXXXXXXXX.X to genes with expert-reviewed structures, appearing optionally in the attributes field.[13] For coding sequences, the ccds_id links to the Consensus CDS (CCDS) project with identifiers like CCDS45890.1, ensuring alignment across species for conserved protein-coding regions.[13] The protein_id attribute connects transcripts to UniProt entries using Ensembl protein identifiers such as ENSP00000328677.4, enabling cross-referencing to protein sequence databases.[13][5]
Splicing-related conventions in GTF emphasize accurate representation of mature transcripts. Untranslated regions (UTRs) are annotated as separate features—five_prime_utr (or "5UTR") for sequences upstream of the start codon and three_prime_utr (or "3UTR") for those downstream of the stop codon—sharing the same transcript_id and gene_id as associated coding sequences (CDS) and exons to maintain hierarchy.[1][13] These UTR features are defined by genomic coordinates within exons and do not include introns, with the frame field indicating codon positioning (0 for complete codons, 1 or 2 for partial). Alternative start or end codons are handled by specifying multiple CDS features per transcript, ensuring the reading frame aligns correctly across splice junctions without overlapping genomic intervals.[1]
Hierarchy enforcement is a core convention to reflect biological nesting: genes encompass transcripts via shared gene_id, transcripts contain exons and CDS via transcript_id, and all features must be non-overlapping within the same chromosome while respecting strand orientation.[1][13] This structure prevents ambiguity in parsing gene models, with IDs unique across the file to avoid cross-chromosome conflicts.[1]
In practice, GTF files often include non-core extensions for enhanced annotation, though these are not part of the original specification. The level attribute denotes annotation confidence, with values like 1 for manually verified genes, 2 for predicted models, and 3 for automated low-confidence entries, as implemented in GENCODE releases.[13] The ont attribute may link features to Gene Ontology (GO) terms, such as PGO:0000004 for specific biological processes, allowing optional integration of functional data without altering the core format.[13] These extensions promote interoperability while preserving backward compatibility with the GTF 2.2 standard.[1]
Examples
Basic Gene Annotation Example
To illustrate the core structure of a Gene Transfer Format (GTF) file, consider a hypothetical single-exon gene located on chromosome 1 (chr1). This example includes the essential features: a gene entry defining the overall locus, a transcript entry representing the single transcript, an exon entry for the untranslated and translated regions, and a CDS (coding sequence) entry specifying the protein-coding portion.[13][1] The following tab-delimited lines represent this annotation in GTF format, where the source is labeled as "example" for simplicity:chr1 example gene 1000 2000 . + . gene_id "geneA"; gene_name "GENEA";
chr1 example transcript 1000 2000 . + . gene_id "geneA"; transcript_id "transcriptA"; gene_name "GENEA";
chr1 example exon 1000 2000 . + . gene_id "geneA"; transcript_id "transcriptA";
chr1 example CDS 1200 1800 . + 0 gene_id "geneA"; transcript_id "transcriptA";
chr1 example gene 1000 2000 . + . gene_id "geneA"; gene_name "GENEA";
chr1 example transcript 1000 2000 . + . gene_id "geneA"; transcript_id "transcriptA"; gene_name "GENEA";
chr1 example exon 1000 2000 . + . gene_id "geneA"; transcript_id "transcriptA";
chr1 example CDS 1200 1800 . + 0 gene_id "geneA"; transcript_id "transcriptA";
Multi-Transcript Gene Example
In the Gene Transfer Format (GTF), genes with multiple transcripts, such as those arising from alternative splicing, are represented by a single gene feature entry followed by separate transcript features, each linked via the sharedgene_id attribute but distinguished by unique transcript_id attributes.[5][13] Consider an illustrative example of a protein-coding gene named "geneB" located on chromosome 1 (chr1), spanning positions 1000 to 3000. This gene produces two transcripts: transcript "t1", an mRNA with two exons (indicating splicing) and a coding sequence (CDS), and transcript "t2", a nonsense-mediated decay (NMD) transcript with a single exon.[1]
The following excerpt shows the relevant tab-delimited GTF lines for this multi-transcript gene (source labeled as "example" for illustration):
chr1 example gene 1000 3000 . + . gene_id "geneB"; gene_type "protein_coding";
chr1 example transcript 1000 3000 . + . gene_id "geneB"; transcript_id "t1"; transcript_type "mRNA";
chr1 example exon 1000 1500 . + . gene_id "geneB"; transcript_id "t1"; exon_number "1";
chr1 example CDS 1100 1400 . + 0 gene_id "geneB"; transcript_id "t1";
chr1 example exon 2000 3000 . + . gene_id "geneB"; transcript_id "t1"; exon_number "2";
chr1 example transcript 1000 2500 . + . gene_id "geneB"; transcript_id "t2"; transcript_type "nonsense_mediated_decay";
chr1 example exon 1000 2500 . + . gene_id "geneB"; transcript_id "t2"; exon_number "1";
chr1 example gene 1000 3000 . + . gene_id "geneB"; gene_type "protein_coding";
chr1 example transcript 1000 3000 . + . gene_id "geneB"; transcript_id "t1"; transcript_type "mRNA";
chr1 example exon 1000 1500 . + . gene_id "geneB"; transcript_id "t1"; exon_number "1";
chr1 example CDS 1100 1400 . + 0 gene_id "geneB"; transcript_id "t1";
chr1 example exon 2000 3000 . + . gene_id "geneB"; transcript_id "t1"; exon_number "2";
chr1 example transcript 1000 2500 . + . gene_id "geneB"; transcript_id "t2"; transcript_type "nonsense_mediated_decay";
chr1 example exon 1000 2500 . + . gene_id "geneB"; transcript_id "t2"; exon_number "1";
transcript_id values allow distinct exon combinations for each isoform while maintaining the overarching gene_id.[5] The exon_number attribute orders exons sequentially from the 5' end within each transcript, aiding in reconstruction of the mature mRNA.[13] Transcript biotypes, such as "mRNA" for protein-coding or "nonsense_mediated_decay" for NMD-prone variants, are specified via transcript_type to classify functional implications.[1]
Common pitfalls in constructing such entries include overlapping transcript_id values across genes, which can cause parsing errors in downstream tools, and failing to ensure non-overlapping exon coordinates within a transcript to accurately define introns.[5] Conventions for biotypes follow standardized lists from sources like GENCODE.[13]
